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1.  Summary 

The  investigators  developed  and  explored  new  algorithms  for  multi-armed  bandits  with 
periodic  fluctuations,  multi-armed  bandits  with  budgets,  and  for  transfer  learning  in  Markov 
Decision  Processes  (MDPs).  In  the  periodic  fluctuations  case,  they  developed  new  theory  and 
promising  experimental  results.  In  the  budgeted  case,  there  are  promising  experimental 
results  but  no  new  theory.  In  the  MDP  case,  existing  theory  was  validated  on  small  problems 
but  did  not  generalize  well  to  larger  problems  due  to  computational  limitations. 


2.  Introduction 

This  project  involved  three  thrusts: 

1 .  New  algorithms  for  multi-armed  bandits  (Rudin). 

2.  New  algorithms  for  multi-armed  bandits  in  the  presence  of  costs  and  a  budget 
(Munagala). 

3.  Evaluation  of  transfer  learning  algorithms  for  Probably  Approximately  Correct  (PAC)- 
optimal  learning  in  MDPs  (Parr). 

The  three  investigators  explored  each  of  these  questions  and  produced  some  new 
algorithms  and  promising  experimental  results,  as  described  in  the  subsequent  sections  and 
attached  documents. 

2.1  Multi-armed  Bandits 

Multi-armed  bandits  abstract  a  basic  learning  question  in  which  the  learned  must  estimate  the 
payoff  or  reward  that  results  from  making  different  choices  (abstracted  as  arms  on  different 
slot  machines).  The  payoffs  are  stochastic,  so  a  single  pull  is  not  sufficient  to  estimate  the 
payoff.  The  learned  must  use  a  strategy  that  balances  exploration  (learning  new  information 
by  pulling  arms  that  might  not  have  had  the  highest  payoff  so  far)  with  exploitation  (pulling 
the  arm  that  is  currently  estimated  to  have  the  highest  payoff). 

Rudin’ s  team  observed  that  payoffs  may  often  have  periodic  fluctuations  that  can  be 
exploited,  a  case  that  was  not  addressed  by  existing  literature.  She  developed  new  algorithms 
and  results,  summarized  in  the  next  section,  and  discussed  in  detail  in  Appendix  Al. 

2.2  Budgeted  Multi-Armed  Bandits 

Budgeted  Multi- Armed  bandits  are  a  generalized  over  the  standard  multi-armed  bandit 
problem  to  the  case  where  arms  have  an  associated  cost  and  there  is  a  fixed  budget  for 
exploration.  This  is,  arguably,  a  more  realistic  scenario  in  which  there  is  a  variable  cost 
associated  with  exploring  various  options  and  this  cost  must  be  taken  into  account. 

Munagala’ s  team  developed  a  new  algorithm  for  the  budgeted  case  and  were  able  to  show 
that  this  algorithm  performed  better  than  the  natural  alternatives  in  simulations.  This  is 
summarized  in  the  next  section,  and  greater  detail  is  provided  Appendix  A2. 
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2.3  PAC-optimal  Transfer  Learning  in  MDPs 

PAC-optimal  learning  in  MDPs  involves  learning  how  to  perform  near  optimally  in  an  MDP 
while  making  a  bounded  number  of  steps  that  are  significantly  suboptimal.  Existing  work  in 
this  area  had  very  limited  ability  to  transfer  knowledge  from  one  problem  to  a  related 
problem. 

In  unpublished  previous  work  (see  appendix  A3)  Parr  and  coauthors  developed  new 
algorithms  for  transfer  learning  that  included  the  use  of  a  transfer  function  to  transfer 
knowledge  between  MDPs.  This  allowed  new  PAC-optimality  bounds.  However,  it  was  not 
clear  if  the  theory  would  be  useful  in  practice. 

Parr’s  team  implemented  and  tested  the  above  approach.  The  results  are  summarized  in  the 
next  section  and  described  in  detail  in  Appendix  A4.  In  addition,  they  explored  the  use  of 
various  manifold  alignment  and  Generative  Adversarial  Network  (GAN)  techniques  to 
achieve  practical  transfer  learning  results. 


3.  Methods,  Assumptions  and  Procedures 

3.1  Bandit  Problems 

Standard  approaches  to  bandit  problems  involve  some  attempt  to  balance  exploration  with 
exploitation.  The  techniques  explored  in  his  project  involve  generalizations  of  the  three  main 
approaches  described  below: 

3.1.1  s-greedy 

The  s  -greedy  approach  picks  the  arm  that  looks  best  with  probability  1-  s,  and  picks  a 
random  arm  with  probability  s.  The  parameter  may  be  chosen  adaptively. 

3.1.2  UCB 

The  Upper  Confidence  Bound  (UCB)  approach  estimates  the  payoff  for  each  arm  within  a 
confidence  interval  and  adds  a  decaying  (with  experience)  exploration  bonus  to  ensure  that 
even  suboptimal  arms  are  continually  tested  -  albeit  decreasingly  with  time. 

3.1.3  Thompson  Sampling 

Thompson  sampling  is  a  Bayesian  approach  that  maintains  a  distribution  over  payoffs. 
Models  are  sampled  from  this  distribution  and  the  distribution  is  updated  based  upon  the 
results.  As  with  UCB,  this  ensure  that  arms  will  be  tried  with  some  positive  probability 
throughout  learning. 

3.2  PAC-optimal  exploration  in  MDPs,  and  transfer  in  MDPs 

Similar  to  the  UCB  algorithm  mentioned  above,  most  PAC-optimal  algorithms  for  MDPs 
involve  some  form  of  exploration  bonus.  Exploration  in  continuous  state  spaces  typically 
involves  some  form  of  state  aggregation  so  that  similar  states  are  clustered  together  to  create 
a  piecewise-constant  value  function.  This  was  done  in  Parr’s  previous  work. 

Transfer  learning  in  MDPs  (and  reinforcement  learning)  can  take  many  forms,  though 
there  is  very  little  work  on  sample  complexity  for  this  case.  Parr’s  work  involves  a  user- 
provided  transfer  function  that  shows  how  samples  from  one  MDP  can  be  adapted  to  another 
MDP. 


Approved  for  Public  Release;  Distribution  Unlimited 
2 


4.  Results  and  Discussion 

4.1  Bandits  with  Fluctuations 

Rudin’s  team  developed  several  new  algorithms  for  this  case.  She  introduced  a  greed 
parameter  that  could  be  regulated  over  time.  The  regulation  of  greed  is  done  in  response  to  a 
known,  external  signal  that  acts  as  a  payoff  multiplier.  For  example,  a  retailer  may  evaluation 
various  strategies  for  reaching  consumers  at  different  times  of  the  year,  but  during  the 
December  holiday  season,  the  payoffs  for  various  strategies  could  be  multiplied  because  of 
increased  consumer  spending.  The  retailer  may  have  a  good  estimate  of  the  multiplier  but  not 
the  base  payoffs. 

Rudin  considered  three  variations  on  existing  work: 

1.  A  variable  arm  pool  approach  that  limits  or  expands  the  number  of  arms  considered 
based  upon  the  payoff  multiplier. 

2.  Variations  on  the  -greedy  approach  that  takes  the  multiplier  into  account. 

3.  Variations  on  the  UCB  algorithm  that  take  the  multiplier  into  account. 

In  each  of  the  above  cases,  Rudin  proved  regret  bounds  for  the  modified  algorithms.  In 
addition,  experiment  results  showed  that  these  algorithms  performed  favorably  in  comparison 
to  the  standard  algorithms  in  presence  of  periodic  payoff  multipliers. 

4.2  Budgeted  Bandits 

Munagala’s  team  modified  the  Thompson  sampling  approach  for  multi-armed  bandits  to  the 
budgeted  cost  case.  Experimentally,  this  approach  was  shown  to  be  superior  to  existing 
algorithms  for  the  budgeted  case,  though  he  was  not  able  to  obtain  any  positive  theoretical 
results  to  support  the  experimental  observations. 

In  addition,  Munagala’s  team  showed  theoretically  that  no  algorithm  could  be  expected  to 
have  reasonable  performance  against  an  adversary  that  chose  arms  costs  in  an  adversarial 
manner.  Allowing  an  adversary  to  set  costs  implies  a  very  strong  adversary,  to  a  weaker 
model  that  allowed  costs  to  change  slowly  was  considered  and  a  new  algorithm  was 
proposed,  though  it  did  not  perform  consistently  better  than  other  approaches. 

Munagala  also  considered  a  generalization  of  the  Thompson  sampling  algorithm  to 
contextual  bandits,  though  the  experimental  results  were  not  promising.  In  the  final  phase  of 
the  project  (not  detailed  in  the  appendices),  the  team  considered  a  contextual  bandit  case 
where  there  is  a  global  context  that  that  multiplies  the  payoffs  (similar  to  the  case  considered 
by  Rudin).  In  this  case,  the  modified  version  of  Thompson  sampling  did  provide  some 
advantage  over  existing  algorithms. 


4.3  Transfer  in  MDPs 

Parr’s  team  conducted  experiments  to  evaluate  the  benefit  of  using  a  transfer  function  to 
improve  transfer  learning  MDPs.  The  transfer  function  describes  how  samples  from  one 
MDP  can  be  transformed  to  act  like  samples  in  another  MDP  (or  a  related  part  of  the  same 
MDP).  For  example,  in  a  single  MDP  with  that  exhibits  some  sort  of  symmetry,  samples  from 
one  quadrant  of  the  state  space  could  be  transformed  to  act  like  samples  from  another  part  of 
the  state  space  by  flipping  the  signs  of  the  state  variables,  effectively  doubling  the  number  of 
reduces  and  reducing  the  sample  complexity  by  half. 
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Experimental  results  for  a  simple  problem  like  the  classic  inverted  pendulum  benchmark 
showed  that  the  expected  reduction  in  the  number  of  samples  required  did  indeed  occur. 
Unfortunately,  it  proved  difficult  to  find  interesting  examples  that  satisfied  the  assumptions  of 
the  theory  and  that  also  had  easily  described  transfer  functions  that  provided  significant 
reduction  in  sample  complexity. 

Parr’s  team  also  considered  other  approaches  to  transfer  such  as  the  use  of  GANs.  Initial 
results,  detailed  in  Appendices  A5  and  A6,  were  somewhat  promising,  but  like  many  GAN 
methods,  it  was  somewhat  unstable.  Prospects  for  extending  this  approach  to  more 
challenging  problems  are  unclear.  It  is  possible  that  improved  GAN  techniques  that  have  been 
developed  recently  could  help. 


5.  Conclusions 

In  the  case  of  multi-armed  bandits,  a  major  finding  of  this  project  is  that  periodic  noise 
multipliers  can  be  exploited.  This  is  supported  by  theoretical  results  and  experimental  results. 
For  arms  with  budgeted  costs,  variations  on  Thompson  sampling  look  promising,  though 
theoretical  results  supporting  the  experimental  results  are  still  lacking. 

For  transfer  learning  in  MDPs,  the  experimentation  done  supports  the  existing  theory,  but 
existing  algorithms  with  theoretical  guarantees  still  do  not  scale  well  to  large  problems.  Other 
techniques,  such  as  transfer  through  GANs,  show  some  promise,  but  further  work  is  required 
to  improve  the  stability  of  such  approaches  and  demonstrate  that  they  can  work  on  larger  and 
more  challenging  problems. 
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Regulating  Greed  Over  Time 

STEFANO  TRACA  AND  CYNTHIA  RUDIN 


Abstract 

In  retail,  there  arc  predictable  yet  dramatic  time-dependent  patterns  in  customer  behavior,  such  as  periodic  changes  in  the 
number  of  visitors,  or  increases  in  visitors  just  before  major  holidays  (e.g..  Christmas).  The  current  paradigm  of  multi-armed 
bandit  analysis  does  not  take  these  known  patterns  into  account,  which  means  that  despite  the  firm  theoretical  foundation  of 
these  methods,  they  are  fundamentally  flawed  when  it  comes  to  real  applications.  This  work  provides  a  remedy  that  takes  the 
time-dependent  patterns  into  account,  and  we  show  how  this  remedy  is  implemented  in  the  UCB  and  c-greedy  methods.  In  the 
corrected  methods,  exploitation  (greed)  is  regulated  over  time,  so  that  more  exploitation  occurs  during  higher  reward  periods, 
and  more  exploration  occurs  in  periods  of  low  reward.  In  order  to  understand  why  regret  is  reduced  with  the  corrected  methods, 
we  present  a  set  of  bounds  that  provide  insight  into  why  we  would  want  to  exploit  during  periods  of  high  reward,  and  discuss 
the  impact  on  regret.  Our  proposed  methods  have  excellent  performance  in  experiments,  and  were  inspired  by  a  high-scoring 
entry  in  the  Exploration  and  Exploitation  3  contest  using  data  from  Yahoo!  Front  Page.  That  entry  heavily  used  time-series 
methods  to  regulate  greed  over  time,  which  was  substantially  more  effective  than  other  contextual  bandit  methods. 
Keywords:  Multi-armed  bandit,  exploration-exploitation  trade-off.  time  series,  retail  management,  marketing,  online  applica¬ 
tions,  regret  bounds. 


1  Introduction 

Consider  the  classic  pricing  problem  faced  by  retailers,  where  the  price  of  a  new  product  on  a  given  day  is  chosen  to  maximize 
the  expected  profit.  The  optimal  price  is  learned  asymptotically  through  a  mix  of  exploring  various  pricing  choices  and 
exploiting  those  known  to  yield  higher  profits,  potentially  through  the  use  of  a  multi-armed  bandit  (MAB).  We  assume  the 
retailer  knows  the  daily  trend  of  the  number  of  customers  over  time  that  visit  the  store.  This  information  can  be  leveraged  in 
order  produce  a  better  exploration/exploitation  scheme.  For  instance,  if  we  know  that  many  customers  will  come  to  the  store  on 
the  week  before  Christmas,  we  would  not  want  to  explore  new  prices  on  those  days.  We  might  even  stop  exploring  all  together. 
Our  setting  violates  the  classic  assumptions  of  random  rewards  with  a  static  probability  distribution  that  is  typically  considered 
in  multi-armed  bandits.  This  is  because  rew  ards  are  correlated  through  the  trends  in  customer  behavior.  If  one  uses  a  standard 
MAB  algorithm  in  the  case  where  trends  are  dramatic,  the  result  could  be  arbitrarily  bad.  A  simple  example  is  the  case  where 
the  number  of  customers  at  the  store  will  have  a  predictably  large  spike  on  a  given  day  (e.g.  for  boxing  day  in  England,  shown 
in  Figure  Oa),  where  the  classic  MAB  algorithm  could  choose  a  poor  price  on  that  particular  day  for  the  purpose  of  exploration. 

For  retailers,  there  are  almost  always  clear  trends  in  customer  arrivals,  and  they  are  often  periodic  or  otherwise  predictable. 
Some  examples  are  in  Figure  Ob  and  Oc.  These  dramatic  trends  might  have  a  substantial  impact  on  which  policy  we  would  use 
to  price  products.  The  main  contributions  of  this  work  are  (i)  A  new  framework  that  illustrates  when  it  is  beneficial  to  stop 
exploration  sometimes  to  favor  exploitation,  (ii)  Novel  algorithms  that  show  how  to  adapt  existing  policies  to  regulate  greed 


Figure  I:  (a)  English  users  shopping  online.  Source:  ispreview.co.uk.  (b)  Google  searches  for  “strawberries".  Source:  Google 
trends,  (c)  Google  searches  for  “scarf.  Source:  Google  trends. 
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over  time.  'ITiese  are:  Algorithm  2:  ^-greedy  algorithm  with  regulating  threshold  (Section  3.3);  Algorithm  3:  soft  e-greedy 
algorithm  (Section  3.4);  Algorithm  4:  UCB  algorithm  with  regulating  threshold  (Section  3.5);  and  Algorithm  5:  soft  UCB 
algorithm  (Section  3.6).  (iii)  Theoretical  regret  bounds  for  the  above  algorithms,  (iv)  Numerical  comparisons  (in  Appendix 
4).  We  compare  to  “smarter"  versions  of  the  classic  e-greedy  algorithm  (Algorithm  7)  and  UCB  algoritlun  (Algorithm  6). 
The  standard  algorithms  incorrectly  estimate  the  mean  rewards  of  the  arms.  The  “smarter"  versions  fix  that  issue,  and  thus 
are  a  reasonable  baseline  to  compare  with.  'Ihe  “smarter"  algorithms  do  not  regulate  greed  over  time  however,  and  are  not 
comparable  in  performance  to  the  algorithms  that  do  this  regulation. 

In  our  setting,  the  behavioral  information  about  customers  is  distilled  so  that  it  takes  the  form  of  a  reward  multiplier  G(t)f 
where  we  assume  G(f)  is  known  or  can  be  well-estimated  before  the  decision  is  made  at  time  t .  G(t)  should  be  thought  of  as 
the  number  of  customers  in  the  store  on  day  t.  If  G(t )  is  not  known  but  could  be  well-approximated,  the  regret  bounds  weaken 
accordingly.  The  new  algorithms,  that  take  advantage  of  knowing  G(t),  are  not  a  simple  extension  of  the  £-greedy  algoritlun 
and  the  UCB  algorithm.  They  anticipate  the  number  of  customers  and  choose  how  much  exploration  to  allow  at  that  timestep. 

As  a  result  of  the  reward  multiplier  function,  theoretical  regret  bound  analysis  of  the  multi-armed  bandit  problem  becomes 
more  complicated,  because  now  the  distribution  of  rewards  depends  explicitly  on  time.  We  not  only  care  how  many  times  each 
suboptimal  arm  is  played,  but  exactly  when  they  are  played.  For  instance,  if  suboptimal  arms  are  played  only  when  the  reward 
multiplier  is  low,  intuitively  it  should  not  hurt  the  overall  regret. 

2  Related  Work 

The  setup  of  tliis  work  differs  from  other  works  considering  time -dependent  multi-armed  bandit  problems  -  we  do  not  assume 
the  mean  rewards  of  the  arms  exhibit  random  changes  over  time,  and  we  assume  that  the  reward  multiplier  is  known  in  advance. 
Other  works  consider  different  scenarios  where  reward  distributions  can  change  over  time,  but  in  a  way  that  is  not  known  in 
advance.  For  these  settings,  the  algorithm  needs  to  compensate  for  changes  in  the  reward  distribution  after  the  change,  rather 
than  altering  their  strategy  in  advance  of  the  change.  Along  these  lines,  [Liu  et  al.,  2013]  consider  a  problem  where  each  arm 
transitions  in  an  unknown  Markovian  way  to  a  different  reward  state  when  it  is  played,  and  evolves  according  to  an  unknown 
random  process  when  it  is  not  played.  [Garivier  and  Moulines,  2008]  presented  an  analysis  of  a  discounted  version  of  the  UCB 
and  a  sliding  window  version  of  the  UCB,  where  the  distribution  of  rewards  can  have  abrupt  changes  and  stays  stationary  in 
between.  [Besbes  et  al.,  2014]  considers  the  case  where  the  mean  rewards  for  each  arm  can  change,  where  the  variation  of 
that  change  is  bounded.  [Slivkins  and  Upfal,  2007]  consider  an  extreme  case  where  the  rewards  exhibit  Brownian  motion, 
leading  to  regret  bounds  that  scale  differently  than  typical  bounds  (linear  in  T  rather  than  logarithmic).  One  of  the  works  that  is 
relevant  to  ours  is  that  of  [Chakrabarti  et  al..  2009]  who  consider  “mortal  bandits"  that  disappear  or  appear. 

A  particularly  interesting  setting  is  discussed  by  [Komiyama  et  al.,  2013],  where  there  are  lock-up  periods  when  one  is 
required  to  play  the  same  arm  several  times  in  a  row-.  Ihere  is  an  equivalence  of  that  problem  to  the  one  studied  here.  In  our 
setting,  we  fix  the  price  of  the  product  for  an  entire  day.  Ill  is  is  equivalent  to  a  setting  where  timesteps  are  taken  for  each 
customer,  but  where  there  is  a  lock  period  over  the  course  of  the  entire  day.  In  other  words,  in  our  scenario,  the  micro-lock-up 
periods  occur  at  each  step  of  the  game,  and  their  effective  lengths  arc  given  by  G(t).  In  the  w'ork  of  [Komiyama  et  al,  2013], 
ADDEDlock  periods  arc  presented  but  there  is  no  regulating  greed  based  on  the  size  of  the  lock  periods. 

The  ideas  in  this  paper  were  inspired  by  a  high  scoring  entry  in  the  Exploration  and  Exploitation  3  Phase  1  data  mining 
competition,  where  the  goal  was  to  build  a  better  recommendation  system  for  Yahoo!  Front  Page  news  articles.  At  each  time, 
several  articles  were  available  to  choose  from,  and  these  articles  would  appear  only  for  short  time  periods  and  would  never 
be  available  again.  One  of  the  main  ideas  in  this  entry  was  simple  yet  effective:  if  any  article  gets  more  than  9  clicks  out  of 
the  last  100  times  we  show  the  article,  and  keep  displaying  it  until  the  clicktlirough  rate  goes  down.  This  alone  increased  the 
clickthrough  rate  by  almost  a  quarter  of  a  percent.  In  the  Yahoo!  advertising  problem,  the  high  reward  period  was  created 
by  the  availability  of  an  article  (an  arm),  which  is  different  than  the  retail  store  case,  but  the  same  effect  is  present,  where 
regulating  the  rate  of  exploitation  (i.e.,  greed)  over  time  is  beneficial  to  overall  rewards.  Here  also,  it  is  useful  to  stop  exploring 
during  times  when  a  function  like  G(t)  is  high.  For  Yahoo!  Front  Page,  articles  have  a  short  lifespan  and  some  articles  are 
much  better  than  others,  in  which  case,  if  we  find  a  particularly  good  article,  we  should  exploit  by  repeatedly  showing  that  one. 
and  not  explore  new  articles.  'Ihe  framework  here  distills  the  problem,  allowing  us  to  isolate  and  study  this  effect  of  a  time 
dependent  function  that  we  can  use  to  regulate  greed  over  time. 

3  Algorithms  lor  regulating  greed  over  time 

This  section  illustrates  the  problem,  the  proposed  algorithms  to  regulate  greed  over  time,  and  theoretical  results  on  the  bound 
on  the  expected  regret  of  each  policy. 
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3.1  Problem  setup 

Formally,  the  stochastic  multi-armed  bandit  problem  with  regulated  greed  is  a  game  played  in  n  rounds.  At  each  round  t  the 
player  chooses  an  action  among  a  finite  set  of  m  possible  choices  called  arms  (for  example,  they  could  be  ads  shown  on  a 
website,  recommended  videos  and  articles,  or  prices).  When  arm  j  is  played  (j  €  {1,  ••  •  ,m})  an  unsealed  random  reward 
Xj(t)  is  drawn  from  an  unknown  distribution  and  the  player  receives  the  scaled  reward  Xj(t)G(t)  where  G(t)  is  the  multiplier 
function.  The  distribution  of  Xj(t)  does  not  change  with  time  (the  index  t  is  just  used  to  indicate  in  which  turn  the  reward  was 
drawn),  while  G(t)  is  a  known  function  of  time  assumed  to  be  bounded  (this  is,  for  instance,  the  number  of  searches  for  a 
particular  item  on  Google  or  Die  number  of  users  on  Skype).  At  each  turn,  tiie  player  suffers  also  a  possible  regret  from  not 
having  played  the  best  arm:  the  mean  regret  for  having  played  arm  j  is  given  by  A j  =  p,  —  pj,  where  /i,  is  the  mean  reward 
of  the  best  arm  (indicated  by  ***”)  and  Pj  is  the  mean  reward  obtained  when  playing  arm  j.  At  the  end  of  each  turn  the  player 
can  update  her  estimate  of  the  mean  reward  of  arm  j: 


where  Tj(t  —  1 )  is  the  number  of  times  arm  j  has  been  played  before  round  t  starts.  This  update  will  hopefully  help  the  player 
in  choosing  a  good  arm  in  the  next  round.  The  total  regret  at  the  end  of  the  game  is  given  by 

n  m 

(2) 

t=l  j= 1 


where  is  an  indicator  function  equal  to  1  if  arm  j  is  played  at  time  l  (otherwise  its  value  is  0).  The  strategies  presented 

in  the  following  sections  aim  to  minimize  the  expected  cumulative  regret  E[Rn\  by  regulating  exploitation  (i.e.,  greed)  of  the 
best  arm  found  so  far,  and  exploration  based  on  the  values  of  the  multiplier  function  G(t).  In  general,  when  the  multiplier 
function  is  high,  the  player  risks  to  incur  in  high  regret  if  a  bad  arm  is  played.  We  show  dial  it  is  beneficial  to  stop  exploration 
in  this  situation  and  resume  exploration  when  rewards  and  regrets  are  lower.  A  complete  list  of  the  symbols  used  throughout 
the  paper  can  be  found  in  Appendix  E. 

3.2  Regulating  greed  with  variable  arm  pool 

In  Algorithm  1  we  present  an  algorithm  that  regulates  greed  by  varying  the  size  mt  of  the  pool  of  arms  that  we  are  allowed  to 
choose.  When  the  greed  function  is  high,  the  pool  size  m*  shrinks,  so  that  we  choose  randomly  among  the  arms  that  performed 
best.  When  the  greed  function  is  low  we  choose  randomly  among  a  larger  pool.  The  size  of  the  pool  is  given  by 


mt  =  min 


(3) 


Algorithm  1:  variable  pool  algorithm 

Input  :  number  of  rounds  n,  number  of  arms  m,  a  constant  c  >  1,  and  {G^OJtLi; 

for  t  =  m  +  1  to  n  do 

Set  pool  size  to  mt  =  min  ( m,  max  ( 1,  )  ) ; 

Play  arm  j  at  random  from  the  pool ; 

Get  reward  G(t)Xj.  Update  %j  ; 

end 


Let  us  define 

A‘  =  ^  E 1  {min  (m- max  ('.  JJy))  =  m 

wliich  is  half  of  the  number  of  times  that  the  pool  contains  all  the  arms  at  time  t.  For  the  following  theorem  we  require  that 
Xt  >  7log(f)  for  some  7  >  5.  If  G(t)  does  not  satisfy  this  requirement,  it  is  easy  to  construct  a  new  G'(t)  from  G(t)  by  setting 
G'{1)  =  (c  -  1  )/t  for  t  €  >t- 1-  2m  if  7log(f)  >  p7log(£  -  1)],  otherwise  keep  G'(t)  =  G(<).  The  following 

theorem  provides  a  bound  on  the  regret  after  n  rounds. 
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3.3  Regulating  greed  with  threshold  in  the  r-greedy  algorithm 

In  Algorithm  2  we  present  a  variation  of  the  e-greedy  algorithm  of  Auer  et  al.  [2002],  in  which  a  threshold  z  has  been 
introduced  in  order  to  regulate  greed.  An  optimal  threshold  z  can  also  be  estimated  by  running  the  algorithm  on  past  data 
and  by  evaluating  the  one  that  gives  the  lowest  regret.  At  each  turn  t,  when  the  rewards  are  “high"  (i.e.,  the  G(t)  multiplier  is 
above  the  threshold  z)  the  algorithm  exploits  the  best  arm  found  so  far.  that  is,  arm  j  with  the  highest  mean  estimate  given  in 
equation  (1).  When  the  rewards  are  “low”  (i.e.,  the  G(t)  multiplier  is  under  the  threshold  ?),  the  algorithm  will  explore  with 
probability  et  =  min  {!»  }  an  arm  at  random  (each  arm  has  probability  1/m  of  being  selected).  The  number  l  counts  how 

many  times  the  multiplier  function  has  been  under  the  threshold  up  to  time  t ,  while  the  constant  k  is  greater  than  10  and  such 
that  k  >  min4  A  .  The  reason  of  this  choice  is  clear  by  looking  at  the  expression  of  / 3j(i )  which  is  a  bound  on  the  probability 
of  considering  incorrectly  a  subopt imal  ami  j  being  the  best  choice.  By  setting  the  parameter  k  accordingly,  we  can  ensure  the 
logarithmic  bound  on  the  expected  cumulative  regret  over  tlie  number  of  rounds  (because  the  et  are  (9  (1  /t)  and  their  sum  over 
time  is  logarithmically  bounded,  while  the  term  is  o(\/l) ). 

Algorithm  2:  s-greedy  algorithm  with  regulating  threshold 
Input  :  number  of  rounds  n,  number  of  arms  m,  threshold  z ,  a  constant  k  >  10,  such  that  k  >  ^  , 

sequences  {£«}”. i  =  min  {l,  }  and  {(7(f)}tLi 

Initialization :  play  all  arms  once  and  initialize  Xj  (defined  in  (1))  for  each  j  =  1,  ,  m 

for  t  =  m  +  1  to  n  do 
if  (G(t)  <  z)  then 

with  probability  et  play  an  arm  uniformly  at  random  (each  arm  has  probability  ^  of  being  selected), 
otherwise  (with  probability  1  —  et)  play  arm  j  such  that 

X,  >  Xf  Vi 


play  arm  j  such  that 


Xj  >  Xi  Vi 


end 

end 

Get  reward  G{t)Xj\ 
Update  ; 


'Hie  following  theorem  provides  a  bound  on  the  mean  regret  of  this  policy  (the  proof  is  given  in  Appendix  A). 
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Theorem  3.2  (^-greedy  algorithm  with  hard  threshold).  The  hound  on  the  mean  regret  E(7?n]  at  time  n  is  given  by 


where 


Elfl.]  <  EGU)Aj 


x-i 


+  E  G(t)l{G(.t)<z}  E  At  (£,n»  ('  - 

t-m+1  ' 


E  G(t)i(c( t)>z)  E  aa(*)> 

t-m+l  j:fxf  </g. 


mke  ) 


(6) 

(7) 

(8) 

(9) 


Theorem  3.3  (f-greedy  algorithm  with  hard  threshold).  The  bound  on  the  mean  regret  E[R„]  at  rime  n  is  given  by 


E(fl„]  <  0(1) 

(10) 

+  E  G(i)1{C(t)<1>  (0 

t-m+1 

GMO) 

(11) 

+  E  G(i)1{C(t)>a)°  ( 

t—  m+1 

9 

(12) 

Intuitively,  this  bound  is  better  than  the  usual  e-greedy  bound  because  when  G(t)  is  low  it  is  multiplied  by  a  quantity 
that  is  of  the  order  8  ! '  +o(\),  while  when  G(t)  is  high  it  is  multiplied  by  a  o  ( | :  quantity. 


The  sum  in  (6)  is  the  exact  mean  regret  during  the  initialization  phase  of  Algorithm  2.  In  (7)  we  have  a  bound  on  the 
expected  regret  for  turns  that  present  low  values  of  G(t),  where  the  quantity  in  the  parenthesis  is  the  bound  on  the  probability 
of  playing  arm  j:  0j  (l )  is  the  bound  on  the  probability  that  arm  j  is  considered  being  the  best  arm  at  round  t,  and  1/m  is  the 
probability  of  choosing  arm  j  when  the  choice  is  made  at  random.  Finally,  in  (8)  we  have  a  bound  on  the  expected  regret  for 
turns  that  present  high  values  of  G(t)  and  in  this  case  we  consider  only  the  probability  0j(t)  that  arm  j  is  the  best  arm  since 
we  do  not  explore  at  random  during  high  reward  periods.  The  usual  E-greedy  algorithm  is  a  special  case  when  G(t)  =  1  Vt 
and  z  >  1.  Notice  that  St  is  a  quantity  8  (l/t),  while  fij(t)  is  o  ( 1  /( ) ,  so  that  an  asymptotic  logarithmic  bound  in  n  holds  for 
£[/£„]  if  t  grows  at  the  same  rate  as  t  (because  of  the  logarithmic  bound  on  the  harmonic  series). 

We  want  to  compare  this  bound  with  the  one  of  the  usual  version  of  the  £-greedy  algorithm  but  since  the  old  version  is 
not  well  suited  for  the  setting  in  which  the  rewards  are  altered  by  the  multiplier  function,  we  discount  the  rewards  obtained  at 
each  round  (by  simply  dividing  them  by  G(t))  so  that  it  can  also  produce  accurate  estimates  of  the  mean  reward  for  each  arm. 
This  “smarter"  version  of  the  E-greedy  algorithm  is  presented  in  Algorithm  7  (Section  4).  The  bound  on  the  probability  of 
playing  a  suboptimal  arm  j  for  the  usual  --greedy  algorithm  is  given  by  0j  ( t )  (i.e.  // ( t )  when  i  =  f)  and  we  refer  to  it  as 
0°u(t).  In  general,  0f*(t)  is  lower  than  0j(t)  (since  l  <  t ).  Intuitively,  this  reflects  the  fact  that  the  new  algorithm  performs 
fewer  exploration  steps.  Moreover,  in  the  usual  E-greedy  algorithm,  the  probability  of  choosing  arm  j  at  time  t  is  given  by 


P({7,okl=7})  =e,I  +  (l-£t)l9?kl(0. 


which  is  less  than  the  probability  of  the  new  algorithm  in  case  of  low  G(l) 


P({7T=i})  =  *4 +  (!-*.)&(«), 

but  can  easily  be  liigher  than  the  probability  of  the  new  algorithm  in  case  of  high  rewards  (which  is  given  by  only  In 
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fact. 


*(«"-*}) -P(tt"-i}>  =  «*£  +  <! -e,)0?(t)-0j(l) 

if  t  >  km  we  get 

<■» 

if  t  <  km  we  get 

(H) 

and  for  (  large  enough  both  expressions  are  positive  since  dj(i)  is  o  ( 1  /  fj  and  we  assume  that  l  is  6(1).  Having  (13)  and 
(14)  positive  means  that  if  we  are  in  a  high-rewards  period  the  probability  of  choosing  a  suboptimal  arm  decreases  faster  in 
Algorithm  2.  In  that  case,  Algorithm  2  would  have  lower  regret  than  the  e-greedy  algorithm. 

hi  practice,  the  threshold  z  should  be  defined  as  argmin(E[fJ„]).  If  this  is  too  computationally  challenging,  but  past  data 
are  available,  a  good  value  for  z  can  be  chosen  using  cross  validation  techniques,  i.e.  by  trying  different  thresholds  with  the 
available  data  and  by  choosing  the  one  that  yields  the  best  performance 

The  following  Corollary  illustrate  the  benefits  of  the  bound  in  a  simple  scenario  when  the  multiplier  function  can  only  take 
two  values  and  the  regulating  threshold  divides  the  higher  value  from  the  lower  one. 


Corollary  3.1.  Suppose  the  greed  function  G(t)  takes  only  two  values:  ond  ghigh-  At  each  turn  t  it  takes  the  value 

Q  low  for  a  fraction  q  of  the  turns  played,  and  the  value  ghigh  for  the  remaining  t  qt  turns  ( for  example,  if  q  —  1/2, 
G(t)  alternates  at  each  turn  between  gi ^  and  ghigh)-  Then,  the  bound  on  the  expected  regret  at  turn  n  reduces  to: 

E[it»]  < 

0(1) 

+ 

“A «»9tow  l(C(t)=!lw)  (7+0(j)) 

+ 

Aw/  ghigh  ^  1  {C(e)=0**}0  (  7  )  » 
t=m+l  '  1  f 

where  A„,  =  Ay. 

The  term  that  hurts  regret  the  most  (1/t)  is  multiplied  only  by  plow,  and  not  by  phigii-  When  the  rewards  arc  high  (and  so  is 
the  possible  regret),  only  terms  of  order  o(\/t)  are  present.  If  exploration  were  permitted  during  the  high  reward  zone,  there 
would  have  been  large  terms  of  pbigh/ f .  which  is  what  the  algorithm  is  designed  to  avoid. 


3.4  Soft  -  -greedy  algorithm 

We  present  in  Algorithm  3  a  “soft  version”  of  the  .--greedy  algorithm  where  greed  is  regulated  gradually  (in  contrast  with  the 
hard  threshold  of  the  previous  section).  Again,  in  high  reward  zones,  exploitation  will  be  preferred,  while  in  low  reward  zones 
the  algorithm  will  explore  tire  arms  more.  Let  us  define  the  following  function 


iK*)  = 


log  1 

1 

log  1 

1  \  1 

1 

(15) 


and  let  7  —  mins€(m+1  (s) .  Notice  that  0  <  r/>(t)  <  1  V(  and  that  its  values  are  close  to  0  when  G(()  is  high,  while 

they  arc  close  to  1  for  low  values  of  G(t).  The  new  probabilities  of  exploration  during  the  game  are  given  at  each  turn  t  by 
et  ~  min  {rp(t)>  }.  In  this  way.  wc  still  maintain  the  linear  decay  of  the  probabilities  of  exploration,  but  we  push  them  to 

zero  to  avoid  high  regrets  when  the  multiplier  function  G?(£)  is  high.  We  generally  assume  that  >n)  G(s)  is  not 

smaller  than  1.  The  usual  case  is  recovered  when  G(t)  =  1  for  all  t. 

The  following  theorem  (proved  in  Appendix  B)  shows  that  a  logarithmic  bound  holds  in  this  case  too  (because  the  et  are 
0  (1/t)  and  their  sum  over  time  is  logarithmically  bounded,  while  the  &f(t)  term  is  o  (1  /t)). 
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Algorithm  3:  Soft  e-greedy  algorithm 

Input  :  number  of  rounds  n,  number  of  arms  m,  a  constant  k  >  10,  such  that  k  >  ml— 4-A  ,  sequences 

{et}g=i  =  min  4as}  and  {G(i)}?=1 

Initialization :  pi  ay  all  arms  once  and  initialize  Xj  (defined  in  (1))  for  each  j  l,  —  ,m 

for  t  —  m  +  1  to  n  do 

Witli  probability  et  play  an  arm  uniformly  at  random  (each  arm  has  probability  Jj  of  being  selected), 
otherwise  (with  probability  1  -  e,)  play  arm  j  such  that 

Xj  >  Xj  Vi 

Get  reward  G(t)Xj\ 

Update  Xj\ 

end 


Theorem  3.4  (Regret-bound  for  soft-E-greedy  algorithm).  The  hound  on  ihe  mean  regret  E[/2n)  at  time  n  is  given  by 

E[jy  < 

rn 

£go)a3 

z-i 

(16) 

+ 

£  G(t)  £  Aj 

t-m+1  • 

(*i+(i-otf(o) 

(17) 

where 

P?(t)  =  k 

a,) 

(18) 

Theorem  3.5  (Regret-bound  for  soft-r -greedy  algoritlim).  The  hound  on  the  mean  regret  E[72n]  at  rime  n  is  given  by 
E[*J  <  0(1)+  £  «(*)  (-!?  (  I)  +-o  (I))  (19) 

Intuitively,  when  G(t)  is  low  it  is  multiplied  by  a  9  (i )  quantity,  while  when  G(t)  is  high  it  is  multiplied  byao(j) 
quantity. 


The  sum  in  ( 16)  is  tire  exact  mean  regret  during  the  initialization  of  Algorithm  3.  For  the  rounds  after  the  initialization 
phase,  the  quantity  in  the  parenthesis  of  (17)  is  the  bound  on  the  probability  of  playing  arm  j  (where  0f(t)  is  the  bound  on  the 
probability  that  arm  j  is  the  best  arm  at  round  t,  and  1/m  is  the  probability  of  choosing  arm  j  when  the  choice  is  made  at 
random). 

As  before,  we  want  to  compare  this  bound  with  the  “smarter”  version  of  the  s-greedy  algorithm  presented  in  Algorithm 
7.  In  the  usual  e-greedy  algorithm,  after  the  “critical  time"  n'  =  km,  the  probability  P(Xjjr1(t-i)  >  X>,7\(t-i))  of  arm 
j  being  the  current  best  arm.  can  be  bounded  by  a  quantity  that  is  o  (1/t)  as  t  grows.  Before  time  n',  the  decay  of 
P(X3ixJ(t_i)  >  XijT.tt-t))  is  faster  and  the  bound  is  a  quantity  that  is  o  (l/tA),  VA  as  t  grows  (see  Remark  1  in  Appendix  A). 
The  probability  of  choosing  a  suboptimal  arm  j  changes  as  follows: 

.  if  t<n',P({J,  =,})  =  £; 

•  if  t  >  n',  P({7,  =  j})  =  £  +  (1  -  4f»)  /^“(t) ,  which  is  6  (1)  as  l  grows. 

In  the  soft ---greedy  algorithm,  before  time  w  defined  as  w  =  argmin  /(s),  subject  to  /(s)  <  7,  where  f(s)  =  we  have 
that  (t),  which  is  the  bound  on  the  probability  P(.Xy,T,(*-i)  >  -?j,r,(t-t))  of  arm  7  being  the  current  best  arm,  is  a  quantity 
that  is  o  (1/ (yt)*),  VA  as  l  grows  (the  argument  is  similar  to  the  Remark  1  in  Appendix  A).  After  w,  it  can  be  bounded  by  a 
quantity  that  is  o  (1/ ('/<))  as  t  grows.  The  probability  of  choosing  a  suboptimal  arm  j  changes  as  follows: 
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.  ir/  < p({/«  -  j})  -  +  (i  -  mwfity. 

•  if  n'  <  t  <  w,  P {{!t  -  j})  -  i  min  {tf(f),  }  +  (I  -  min  {^(f),  })  Pf(t)\ 

•  iff  >  w,P({/i  —  j})  —  7  +  (I  -  j)Pf(t)- 

In  order  lo  interpret  these  quantities,  let  us  see  what  happens  for  high  or  low'  values  of  the  multiplier  G(t)  as  t  grows  in  Table 
1.  For  brevity,  wc  abuse  notation  when  using  Landau  *s  symbols,  because  in  sonic  cases  t  is  not  allowed  to  go  to  infinity;  it 
is  convenient  lo  still  use  the  “little  o"  notation  to  compare  the  decay  rales  of  the  probabilities  of  choosing  a  suboptirnal  arm, 
which  also  gives  a  qualitative  explanation  of  what  happens  when  using  the  algorithms.  For  the  soft-<~-algorithm.  the  rate  at 
which  the  probability  of  choosing  a  suboptirnal  arm  decays  is  faster  when  G(l)  is  high,  and  worse  when  G(t)  is  low'.  Notice 
that  the  parameter  7  slows  down  the  decay  with  respect  lo  the  usual  -  greedy  algorithm.  This  is  direct  consequence  of  the 
slower  exploration.  A11  example  of  a  typical  behavior  of  ^{i)  and  ef*  is  shown  in  Figure  2,  where  G{1)  —  20  +  19sin(//2). 

Table  1:  Summary  of  the  decay  rate  of  the  probabilities  of  choosing  a  suboptirnal  arm  for  the  soft-c -greedy  algorithm  and  the 
usual  t-greedy  algorithm  (supposing  it  is  taking  in  account  the  time-patterns.)  The  decay  depends  on  the  time-regions  of  the 
game  presented  in  Figure  2. 


Region 

round 

G(t) 

p  ay  -j})M 

p(Ut-j}rl'<p({/1-j})“,d? 

1 

t  <  «' 

high 

low 

in 

_1_ 

m 

close  to  ^ 

yes.  much  l>cttcr 

no,  but  not  by  much 

2 

nf  <  t  <  tv 

high 

low 

«(!)+•  (4) 

«(4)+»(4) 

°(wx)’VA 

fl(r)  1 

yes.  much  better 

yes,  but  not  by  much 

3 

t>w 

high 

low 

*(4)+*(4> 

*(4)+*(4> 

*  «)+•(*) 
*(»  +  •(*) 

no,  but  not  by  much 

no,  but  not  by  much 

Figure  2:  Comparison  of  probabilities  of  exploration  over  the  number  of  rounds.  Before  n\  is  1  and  always  greater  than 
tj>(t ).  After  iv%  is  always  less  than 
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3.5  Regulating  greed  in  the  UCB  algorithm 

hollowing  what  has  been  presented  to  improve  the  s-greedy  algorithm  in  this  setting,  we  introduce  in  Algorithm  4  a  modification 
of  the  UCB  algorithm.  We  again  set  a  threshold  2  and,  if  the  multiplier  of  the  rewards  G(t)  is  above  this  level,  the  new 
algorithm  exploits  the  best  arm.  When  G(t)  is  under  the  threshold,  the  algorithm  is  going  to  play  the  arm  with  the  highest 
upper  confidence  bound  on  the  mean  estimate. 


Algorithm  4:  UCB  algorithm  with  regulating  threshold 
Input  :  number  of  rounds  n,  number  of  arms  m,  threshold  2,  sequence  {G(0}tLi 

Initialization :  play  all  arms  once  and  initialize  Xj  (as  defined  in  (1))  for  each  j  =  1,  •  •  •  ,  m 
lor  f  =  m  -f  1  to  n  do 

if  ((7(f)  <  2)  then 

j  play  arm  j  with  the  highest  upper  confidence  bound  on  the  mean  estimate 


Xj.T,  («-!)  + 


2  logf 


else 

play  arm  j  such  that 

Xj  >  Xi  Vt; 

end 

end 

Get  reward  G(t)Xj\ 

Update  Xj\ 


It  is  possible  to  prove  that  also  in  this  case  die  regret  can  be  bounded  logarithmically  in  n.  Let  B  —  {t  :  G(t  -  1)  < 
2,  (7(f)  >  2}  be  the  set  of  rounds  where  the  high-reward  zone  is  entered,  and  let  rt  be  the  last  round  of  the  high-reward 
zone  that  was  entered  at  time  f.  Let  us  call  2/1 , 2/2 ,  •  •  •  ,yo  the  elements  of  B  and  order  them  in  increasing  order  such  that 
V\  <  V2  <  •  •  *  <  VB ■  Let  us  also  define  for  every  k  €  {1,  •  •  •  ,  |£|}  the  set  Yfc  =  {f  :  f  >  y*.,  (7(f)  >  zyt  <  yk+i}  (where 
ys+ 1  -  n)  of  times  in  the  high-reward  period  entered  at  time  »/*,  and  let  Afc  =  ma xt<-yk  (7(f)  the  highest  value  of  (7(f)  on  Y*. 
Finally,  for  every  kr  let  R k  =  A*|Yfc|. 

Now,  given  a  game  of  n  total  rounds,  we  can  “collapse”  the  kth  high  reward  zone  into  the  entering  time  y*  by  defining 
G(yk)  —  Rk ,  for  all  k.  Now,  the  maximum  regret  over  B  is  given  by  (maxj  A  j)  ^£1  Rk-  By  eliminating  the  set  B  from  the 
game,  we  have  transformed  the  original  game  into  a  shorter  one,  with  tj  steps,  where  (7(f)  is  bounded  by  2  and  the  usual  UCB 
algorithm  is  played.  When  the  size  of  set  B  decreases  with  n,  (is  of  order  0(\/t)  after  an  arbitrary  time),  the  total  regret  has  a 
logarithmic  bound  in  n. 

The  c-greedy  methods  are  more  amenable  to  this  type  of  analysis  than  UCB  methods,  because  the  proofs  require  bounds 
on  tlie  probability  of  choosing  tlie  wrong  arm  at  each  turn.  The  UCB  proof  instead  require  us  to  bound  the  expected  number 
of  times  the  suboptimal  arms  are  played,  without  regard  to  when  those  arms  were  chosen.  We  were  able  to  avoid  using  the 
maximum  of  file  (7(f)  values  in  the  £ -greedy  proofs,  but  this  is  unavoidable  in  the  UCB  proofs  without  leaving  terms  in 
the  bound  that  cannot  be  explicitly  calculated  or  simplified  (an  alternate  proof  would  use  weaker  Central  Limit  Theorem 
arguments). 
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Theorem  3.6  (Regret-bound  for  the  regulated  UCB  algorithm).  The  bound  on  the  mean  regret  Ef-fty,]  at  time  n  is  given 

by 

EIA.]  <  Ay 

j=l 

(20) 

+  Z 

»  E 

(21) 

1*1 

+  (max  Aj)  22  Rk 

3  k^\ 

(22) 

Theorem  3.7  (Regret-bound  for  the  regulated  UCB  algorithm).  The  bound  on  the  mean  regret  E[/l„]  at  time  n  is  given 
by 


E  [Rn]  < 

0(1) 

(23) 

+ 

2  0(log(7)))  4-  0(1) 

(24) 

+ 

0(1) 

(25) 

The  first  sum  in  (20)  is  the  exact  mean  regret  of  the  initialization  phase  of  Algorithm  4,  the  tliird  sum  in  (22)  is  the  bound 
on  the  regret  from  the  high-reward  zones  that  have  been  collapsed,  and  the  second  term  in  (21 )  is  the  bound  on  the  regret  for  rj 
rounds  when  G(l)  is  under  the  threshold  z  and  it  follows  from  the  usual  bound  on  the  UCB  algorithm  (for  n  rounds  die  UCB 
algorithm  has  a  mean  regret  bounded  by  Y^jLi  ip1  +  (1  +  ^f  )  Yl™=i  A y). 

Again,  the  threshold  z  should  be  defined  as  argmin(E[f?„] )  or,  if  past  data  are  available,  z  can  be  chosen  using  cross  validation. 


3.6  The  soft  UCB  algorithm 


Ln  Algorithrn5,  present  now  a  “soft  version”  of  the  UCB  algorithm  where  greed  is  regulated  gradually  (in  contrast  with  die 
hard  threshold  of  the  previous  section).  Again,  in  high  reward  zones,  exploitation  will  be  preferred,  while  in  low  reward  zones 
the  algorithm  will  explore  die  arms. 

Let  us  define  the  following  function: 


(26) 


At  each  turn  t  of  the  game,  the  algorithm  plays  the  arm  with  the  highest  upper  confidence  bound  on  the  mean  estimate,  but,  with 
the  introduction  of  £(<),  the  confidence  interval  around  Xj  x, (t- 1)  is  built  in  a  way  such  that,  when  G(t)  is  high,  it  collapses 
on  the  estimate  itself,  forcing  the  player  to  choose  the  arm  with  the  highest  mean  estimate  (thus,  leading  to  a  pure  exploitation 
policy).  In  contrast,  when  the  multiplier  G(t)  is  low,  tire  confidence  interval  around  Ay.'r, (t_l)  stretches  out,  making  die  player 
explore  more  easily  arms  with  high  uncertainty. 

One  of  the  main  difficulties  of  the  formulation  of  these  bounds  is  to  define  a  correct  functional  form  for  ((0  so  that  it  is 
possible  to  obtain  smoodiness  in  the  arm  decision,  reasonable  ChernofT-HoefTding  inequality  bounds  while  working  out  the 
proof  (see  Appendix  C),  and  a  convergent  series  (the  second  summation  in  (28)). 

Also  in  this  case,  it  is  possible  to  achieve  a  bound  that  grows  logarithmically  in  n. 


Theorem  3.8  (Regret-bound  for  soft-UCB  algorithm).  The  bound  on  the  mean  regret  E(T(„]  at  time  n  is  given  by 


E[R„]  <  VflO)A; 


•f  max  <?(t) 

,r»> 


,  R  4 108  (*<  ™  *m) +SAf(1+ ,  i,  *(tr4(t  - 1  - mf) 


(27) 


(28) 
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Algorithm  5:  Soft  UCB  algorithm 

Input  : number  of  rounds  n,  number  of  arms  m,  sequence  {(?(0}!Li 

Initialization : play  all  arms  once  and  initialize  Xj  (as  defined  in  (1))  for  each  j  =  1,  •  •  •  ,  m 
for  t  =  m  +  1  to  n  do 

play  arm  j  with  the  highest  upper  confidence  bound  on  the  mean  estimate: 


1)  + 


2  1og£(*). 
Tj(t-iy 


Get  reward  G(t)Xj\ 
Update  Xj\ 


Theorem  3.9  (Regret-bound  for  soft-UCB  algorithm).  The  hound  on  the  mean  regrei  E[7?n]  ai  time  n  is  given  by 


E[Rn]  <  0(1) 

+  max  G(t)  \0  ( log  (  max  +0(1)1 

te{m+ l.  .n}  Wl  \  \t€{m+l.  ,n)  '  7  /  WJ 


(29) 

(30) 


The  first  sum  in  (27)  is  the  exact  mean  regret  of  the  initialization  phase  of  Algorithm  5.  For  the  rounds  after  the  initialization 
phase,  the  mean  regret  is  bounded  by  the  quantity  in  (28),  which  is  almost  identical  to  the  bound  of  the  usual  UCB  algorithm  if 
wc  assume  C?(£)  =  1  (i.c.,  rewards  arc  not  modified  by  the  multiplier  function). 


4  Experimental  results 

We  consider  three  types  of  multiplier  function  G(t): 

•  The  Wave  Greed  (Figure  3a):  in  this  case  customers  come  in  waves:  G(t)  =  21  +  20sin(0.25t).  We  want  to  exploit  the 
best  arm  found  so  far  during  the  peaks,  and  explore  the  other  arms  during  low-rewards  periods ; 

•  The  Christmas  Greed  (Figure  3b):  again,  G(t)  =  21  +  20sin(0.25t),  but  when  t  €  {650,651,  •  •  •  ,670},  G(t)  -  1000 
which  shows  that  there  is  a  peak  in  the  rewards  offered  by  the  game  (which  we  call  “Christmas",  in  analogy  to  the 
phenomenon  of  the  boom  of  customers  during  the  Christmas  holidays) ; 

•  The  Step  Greed  (Figure  3c):  this  case  is  similar  to  the  Wave  Greed  case,  but  this  time  the  function  is  not  smooth: 
G(t)  =  200,  but  for  t  €  {600, 601 ,  •  •  •  ,800}U{1000, 1001,  •  •  •  ,  1200}U{1400, 1402,  •  •  ,1600}  wc  have  G(t)  =  400 

We  consider  a  game  with  500  arms  and  normally  distributed  rewards.  Each  arm  j  €  {1,-  •  •  ,500}  has  mean  reward 
pj  =  0.1  +  (200  +  1 .5(500  —  j  +  1))/(1.5  x  500)  and  common  standard  deviation  a  -  0.05.  The  arms  were  chosen  in  this 
way  so  that  Xj  would  take  values  (with  high  probability)  in  [0, 1).  Having  a  bounded  support  for  Xj  is  a  standard  assumption 
made  when  proving  regret  bounds  (sec  Auer  ct  al.  [2002]).  We  play  2000  rounds  each  game.  After  2000  rounds  the  algorithms 
all  essentially  have  determined  which  arm  is  the  best  and  tend  to  perform  very  similarly  from  that  point  onwards. 

rrhe  well-known  UCB  and  e-greedy  algorithms  are  not  suitable  for  the  setting  in  which  the  rewards  are  altered  by  the 
multiplier  function.  Thus,  in  their  current  form,  we  can  not  compare  directly  with  them.  The  fact  that  rewards  are  multiplied 
would  irremediably  bias  all  the  estimations  of  the  mean  rewards,  leading  UCB  and  e-greedy  to  choose  arms  that  look  good 
just  because  they  happened  to  be  played  in  a  high  reward  period.  For  example,  suppose  wc  show  an  ad  on  a  website  at  lunch 
time:  many  people  will  sec  it  because  at  that  time  the  web-surfing  is  at  its  peak  (i.c.,  the  G(t)  multiplier  is  high).  So  even  if 
the  ad  was  bad,  we  may  register  more  clicks  than  a  good  ad  showed  at  3:00AM  (i.c.,  the  G(t)  multiplier  is  low).  To  obtain  a 
fair  comparison,  we  created  “smarter"  versions  of  the  UCB  and  £-greedy  algorithms  in  which  the  rewards  are  discounted  at 
each  round  (by  simply  dividing  them  by  G(t))  so  that  also  the  old  version  of  the  algorithms  can  be  smarter  in  that  they  can 
produce  accurate  estimates  of  the  mean  reward  for  each  arm.  The  smarter  version  of  the  usual  UCB  algorithm  is  presented  in 
Algorithm  6  and  the  one  for  the  e-greedy  algorithm  is  shown  in  Algorithm  7.  For  the  three  multiplier  functions,  wc  report  the 
performance  of  the  algorithms  in  Figures  4,  5.  and  6. 
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Figure  3:  Shapes  of  the  multiplier  functions  used  in  the  experiments. 


(a)  The  Wave  Greed 


(b)  The  Christmas  Greed 


(c)  The  Step  Greed 


In  Figures  7,  8,  and  9,  we  change  the  rewards  to  have  a  Bernoulli  distribution  (the  assumption  of  bounded  support  is 
verified).  Similarly  to  the  normal  case,  each  arm  j  6  {1,  •  •  •  ,  500}  has  probability  of  success  pj  =  0.1  +  (200  +  1.5(500  - 
j  +  1))/(1.5  x  500).  One  of  the  advantages  of  the  f-greedy  algorithm  is  that  there  are  no  assumptions  on  the  distribution  of 
the  rewards,  while  in  UCB  they  need  bounded  support  ([0. 1]  for  convenience,  so  it  is  easier  to  use  Hoeffding’s  inequality). 


Figure  4:  Comparison  for  the  Wave  Greed  case. 


Figure  5:  Comparison  for  the  Christmas  Greed  case. 


(a)  f-greedy  algorithms  rewards. 


(b)  UCB  algorithms  rewards. 


(c)  Final  rewards  comparison 
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Algorithm  6:  Smarter  version  of  the  usual  UCB  algorithm 


Input  :numbcr  of  rounds  n,  number  of  arms  to,  sequence  {(7(4)} 

Initialization :  play  all  arms  once  and  initialize  Xj  (as  defined  in  (1))  for  each  j  —  1, 
for  /  =  m  -f  1  to  n  do 

play  arm  j  with  the  highest  upper  confidence  bound  on  the  mean  estimate: 


xi.T,i  <-i)  + 


2  log(i) 

3>(«  —  1) 


Get  reward  G(t)Xj\ 
Update  Rj\ 


end 


Algorithm  7:  Smaller  version  of  the  usual  e-greedy  algorithm 
Input  inumber  of  rounds  n ,  number  of  arms  to,  a  constant  c  >  10,  a  constant  d  such  tliat  d  <  minj  A  j  and 

0  <  d  <  1,  sequences  {e£}”=1  =  min  {1,  and  {<?(4)}?=i 
Initialization :  play  all  arms  once  and  initialize  Xj  (as  defined  in  (1))  for  each  j  =  1,  -  -  ,  m 
for  t  =  m  -f  1  to  n  do 

with  probability  et  play  an  arm  uniformly  at  random  teach  arm  has  probability  X  of  being  selected), 
otherwise  (with  probability  1  —  st)  play  arm  j  such  that 

Xj  >  Xi  Vi 

Get  reward  G{t)Xj\ 

Update  Xj\ 

end 
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Figure  6:  Comparison  for  ihe  Step  Greed  case. 


(a)  e-greedy  algorithms  rewards. 


Figure  7:  Comparison  for  the  Wave  Greed  case. 


(a)  greedy  algorithms  rewards. 


(b)  UCB  algorithms  rewards. 


(c)  Final  rewards  comparison. 


Figure  8:  Comparison  for  the  Christmas  Greed  ease. 


(a)  f-greedy  algorithms  rewards. 


4.1  Discussion  on  Yahoo!  contest 

The  motivation  of  this  work  comes  from  a  high  scoring  entry  in  the  Exploration  ami  Exploitation  3  contest,  where  the  goal 
was  to  build  a  better  recommender  system  for  Yahoo!  Front  Page  news  article  recommendations.  The  contest  data,  which 
was  from  Yahoo!  and  allows  for  unbiased  evaluations,  is  described  by  U  et  al.  [2010],  These  data  had  several  challenging 
characteristics,  including  broad  trends  over  time  in  click  through  rate,  arms  (news  articles)  appearing  and  disappearing  over 
time,  the  inability  to  access  the  data  in  order  to  cross-validate,  and  other  complexities.  This  paper  does  not  aim  to  handle  all 
of  these,  but  only  the  one  which  led  to  a  key  insight  in  increased  performance,  which  is  the  regulation  of  greed  over  time. 
Although  there  were  features  available  for  each  time,  none  of  the  contestants  were  able  to  successfully  use  the  features  to 
substantially  boost  performance,  and  the  exploration/cxploitation  aspects  turned  out  to  be  more  important.  Here  arc  the  main 
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Figure  9:  Comparison  for  the  Step  Creed  case. 


(a)  c-greedy  algorithms  rewards. 


(b)  UCB  algorithms  rewards. 


(c)  Final  rewards  comparison. 


insights  leading  to  large  performance  gains,  all  involving  regulating  greed  over  time: 

•  “Peak  grabber":  Stop  exploration  when  a  good  arm  appears.  Specifically,  when  the  article  was  clicked  9/100  times,  keep 
showing  it  and  stop  exploration  all  together  until  the  ami's  click  through  rate  drops  below  that  of  another  arm.  Since  this 
strategy  does  not  handle  the  massive  global  trends  we  observed  in  the  data,  it  needed  to  be  modified  as  follows: 

•  “Dynamic  peak  grabber":  Stop  exploration  when  the  click  through  rate  of  one  arm  is  at  least  15%  above  that  of  the 
global  click  through  rate. 

•  Stop  exploring  old  articles:  We  can  determine  approximately  how  long  the  arm  is  likely  to  stay,  and  we  reduce  exploration 
gradually  as  the  arm  gets  older. 

•  Do  not  fully  explore  new  arms:  When  a  new  arm  appears,  do  not  use  1  as  the  upper  confidence  bound  for  the  probability 
of  click,  which  would  force  a  UCB  algorithm  to  explore  it.  use  .88  instead.  This  allows  the  algorithm  to  continue 
exploiting  the  arms  that  are  known  to  be  good  rather  than  exploring  new  ones. 

The  peak  grabber  strategies  inspired  the  abstracted  setting  here,  where  one  can  think  of  a  good  article  appearing  during  periods 
of  high  G(t).  where  we  would  want  to  limit  exploration:  however,  the  other  strategies  are  also  relevant  cases  where  the 
exploration/exploitation  tradeoff  is  regulated  over  time.  There  were  no  “lock-up"  periods  in  the  contest  dataset,  though  as 
discussed  earlier,  the  G(t)  function  is  also  relevant  for  modeling  that  setting.  The  large  global  trends  we  observed  in  the  contest 
data  click  through  rates  are  very  relevant  to  the  G{t )  model,  since  obviously  one  would  want  to  explore  less  when  the  click  rate 
is  high  in  order  to  get  more  clicks  overall. 

5  Conclusions 

The  dynamic  trends  we  observe  in  most  retail  and  marketing  settings  are  dramatic.  It  is  possible  that  understanding  these 
dynamics  and  how  to  take  advantage  of  them  is  central  to  the  success  of  multi-armed  bandit  algorithms.  We  showed  in  this 
work  how  to  adapt  regret  bound  analysis  to  this  setting,  where  we  now  need  to  consider  not  only  how  many  times  an  arm 
was  pulled  in  the  past,  but  precisely  when  the  arm  was  pulled.  The  key  element  of  our  algorithms  is  that  they  regulate  greed 
(exploitation)  over  time,  where  during  high  reward  periods,  less  exploration  is  performed. 

There  are  many  possible  extensions  to  this  work.  In  particular,  if  G(t)  is  not  known  in  advance,  it  may  be  easy  to  estimate 
from  data  in  real  time,  as  in  the  dynamic  peak  grabber  strategy.  The  analysis  of  the  algorithms  in  this  paper  could  be  extended  to 
other  important  multi-armed  bandit  algorithms  besides  ^-greedy  and  UCB.  Further.future  work  will  consider  the  connection  of 
mortal  bandits  (with  appearing/disappearing  arms)  with  the  G(t)  setting,  since  for  mortal  bandits,  each  bandit's  G(t)  function 
can  change  at  a  different  rate. 
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A  Regret-bound  for  r -greedy  algorithm  with  hard  threshold 

The  regret  at  round  n  is  given  by 

n  m 

fln  =  £^A,fl(t)l,,1#  (31) 

t=i  j- 1 

where  G (i)  is  die  greed  function  evaluated  at  time  (,  1  { /,  is  an  indicator  function  equal  to  1  if  arm  j  is  played  at  time  t 
(otlicrwise  its  value  is  0)  and  A  j  =  n'  -  p,j  is  tlie  difference  between  the  mean  of  tlic  best  arm  reward  distribution  and  the 
mean  of  the  j's  arm  reward  distribution.  By  considering  the  threshold  z  which  determines  which  rule  is  applied  to  decide  what 
arm  to  play,  we  can  rewrite  the  regret  as 


Rn  =  Y  Y  {G(t)<z}  1  {/«=*}  + 

t=  1  3=  1 
n  m 

+  YY 

f=l  j= 1 

By  taking  the  expectation  we  have  that 

E[ft,]  =EEA,G(i)l{c(t)<z}P({/«  =  j})  + 

t=tj.t 
n  m 

+  EE  A>G(e)i{c(«)>,}P({/«  = ;}), 

t=l>l 

which  can  be  rewritten  as 

El^nl  -  YT.  |  f  t~  +  0  “  £t)P(X>,Ti(t-l)  >  Xi,T%(t-l)  V0] 

t=lj=l  L  m  J 

n  m 

+  E  E  >  Aj.T.tt-n  VQ-  (32) 

t=t  j=i 

For  the  rounds  of  the  algorithm  where  G(t)  <  z,  we  are  in  the  standard  setting,  so  for  those  limes,  we  follow  the  standard 
proof  of  Auer  et  al.  [2002].  For  the  times  that  are  over  the  threshold,  we  need  to  create  a  separate  hound.  Let  us  now  hound  the 
probability  of  playing  the  sub-optimal  arm  j  at  time  l  when  the  greed  function  is  above  the  threshold  z. 

f(E.T,( t-i)  £  Er.(t-i)  Vi)  <  P(A’j,t,(«-i)  >  r.o-i))  (33) 

5  P  (  F  P  — 2^  )’  (34) 

where  tlic  last  inequality  follows  from  the  fact  that 

l)  ^  -Er.(t-i)}  C  ^|x,iT.(t_i)  </i.-  j'  j  u  {E.7)(«-»  ^  W+  "^J)  •  (35) 

In  fact ,  suppose  that  there  exist  an  element  ui  e  |  A  ,  p  (t-i)  >  -? . ,r. (t- 1) |  'hat  does  not  belong  to 
(  {  Er.(t-i)  <  ft*  -  %■ }  U  |  ^j.r,(t-l)  >  fij  +  ^-})-  Then,  we  would  have  tiiat 

U  €  — 2^}  U  {^(t-1)  >  +  “2^|)  (36) 

=  {-£>  ,T.(t-l)  }n{  Xj%Tl{t- 1)  <Vj+  Y  j  ,  (37) 

but  from  the  intersection  of  events  given  in  (37)  it  follows  that  >//,-%-  =  l*j  +  %-  >  -Et,((«-i)  which 

contradicts^  €  { X,,t) (i- 1>  >  X,tr,(«-i)}. 
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Therefore,  all  elements  of  belong  to  <  V*  %-}  U  >  l*j  +  ^})- 

Let  us  consider  the  first  term  of  (34)  (file  computations  for  the  second  term  are  similar), 


=  &(w-1)=s'4>^+y) 

E  p  (r>(‘  -  1)  =  *14.  >  ft  +  ^)  P  (4.  >ft  +  y) 

<  E  P  ^(i  - 1)  =  *i4.  >  ft  +  ^) 


(38) 


where  in  the  last  inequality  we  used  the  Chemoff-Hoeffdings  bound.  Let  us  define  Tj* (t  -  1)  as  file  number  of  times  arm 
j  is  played  at  random  (note  that  1)  <  Tj(t  —  1)  and  that  T*(t  1)  =  where  Bt  is  a  Bernoulli  r.v.  with 

parameter  s,/m),  and  let  us  define 


At 


3=1 


where  i  is  the  number  of  rounds  played  under  the  threshold  z  up  to  time  t .  Then, 


(38)  < 


< 


< 


< 


*  !•*>,,  >  ft  + 


*)■ 


E  < 


linj  /  A  \  0 

E  p  (Tjd  -  1)  =  .IX*.  >  ft  +  f  )  +  -^e-  -+  IM 

E  P  (rf(t  - 1)  <  s\Xj,,  >  N 

lA,JP(T«(t-l)<LA,j)  +  ^e-^J 

j 


(39) 


where  for  the  first  |AeJ  terms  of  the  sum  we  upperbounded  e  ~J~a  by  1,  and  for  file  remaining  terms  we  used  the  fact  that 
j+i  e~k$  —  i e~kx>  where  in  our  case  k  —  We  have  that 


m 


,«« -  D]  =  i  Es-  v^* -1))  =  E^(1-^)^iEe*= W  -  i)i. 


anti,  using  the  Bernstein  inequality  P(S'„  <  E[S„]  -  a)  <  exp{- }  with  S„  Tf(t  -  1)  and  a  -  |E [Tfit  -  1)], 

P(r«(t-1)<  LA.J)  =  P  (rfit  -  1)  <  E[T/(t  -  1)]  -  \E[Tf(t  -  1)A 

f  j(E[r»(t-l)l)3  } 

-  <KPl  E[T/(«-l)]  +  iE[T/(t-l)]/ 

=  eKp{-^Elr/!(t_1)l}  =exp|"^LA'j|  ■  (4°) 
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Now  we  need  a  lower  bound  on  |A*J.  I-et  us  define  n  =  [km\,  then 

1 


At  = 


2  m 


£*• 


«=1  v  J 

1  n  1  * 

1  v  i  '  v 

2m  2m 


km 

s 


=  S  +  Sll£ 


a=n'+l 

km 

«-i  5  *-l 


>  2^  +  2  (log(*  +  1)  -  0°g(n')  +  l°g(e)) 

>  ^log 
=  ^log 


(-1) 

+  ||og(4) 

\mk  J 

2  \n'e  J 

(— 

\ 

\  mke  . 

)■ 

(41) 


Remark  1.  Note  that  if  l  (or  t  in  the  usual  ■  -greedy  algorithm)  was  less  titan  n',  then  we  would  have  Xt  =  t/2m,  yielding  an 
exponential  decay  of  the  bound  on  the  probability  of  j  being  the  best  arm.  To  see  this,  l  <n'  would  imply  that,  using  (39 )  and 
(40), 


i  (  l  t  )  2  I 

Ay  i  \ 

2^exp|-^/+A?exp( 

2  2m  J 

Continuing  the  proof  of  Theorem  3.1,  we  obtain  a  bound  on  the  first  term  in  (34)  as  follows.  Using  (41)  combined  with 
(40)  in  (39),  we  get  that 

O.V*. 


2  \  mke  J 


mke  J  A?  \  mke  J 


(42) 


Since  the  computations  for  the  second  term  in  (34)  arc  similar,  a  bound  on  P  (Xj,T,  (t- 1)  >  %i,T,  (t- 1)  Vi )  is  given  by 


(43) 


We  have  now  an  upper  bound  for  P^y^p-i)  >  t-i)  Vi).  We  can  ase  this  to  easily  bound  P(-?y,r,(t-i)  > 
Xt,  ,T,(t-i)  Vi)  in  (32)  which  yields  the  following  bound  on  the  mean  regret  at  time  n: 

m 

E  [flu]  <  £C(j)  Ay 

J-l 

+  jr  G(i) l{ow<t}  £  Ay  (e,2  +  (i  -£,)&(*)) 

e=m+l  ^  ' 

+  E  {G(t)>z)  E 

t-m+l  }■&,<(*• 

This,  combined  with  the  bound  above,  proves  the  theorem. 

B  Logarithmic  hound  for  Soft-£ -greedy  algorithm 

At  each  round  t,  arm  j  is  played  with  probability 

^  +  (l-et)P(^y>liVi), 
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where  et  =  min  {t/»( i ),  }  ami 


log  1 

v1  +  nfo) 

log  1 

r,  ^ 

1 

\ 

(‘  '  min,c{m+i, 

Recall  that  7  =  mini<t<n  ip(t). 

Let  us  bound  the  probability  P({7f  =  j})  of  playing  the  sub-optimal  arm  j  at  time  t.  We  have  that 

P  ^  ■^4,Ti(*-l)  Vi)  <  P  (^,T,(t-l)  >  -?*,T.(t-l)) 

5  P  >  lb  +  ^  J  +  •"  i)  iMt  +  yj- 

For  tiie  bound  on  the  two  addends  in  (45),  we  have  identical  steps  to  the  proof  for  Theorem  1,  and  thus 

P  >  W  +  <  LA»JP(T/(«  -  1)  <  [AtJ)  + 

P  <  P.  +  <  LA.JP  (T/(<  -  1)  <  LA.J)  + 

and  we  again  have 

P(3f  (*-!)<  LA*J)<«p{-1lA.j}. 

Now  we  need  a  lower  bound  on  |A*J.  Let  t  >  w  where  w  =  min{l,  •  •  •  ,  n}  such  tliat  <  7.  Then, 


(44) 


(45) 


(46) 

(47) 


(48) 


At  = 


> 

> 

> 


a=2> 

a—  1 


1  1  /  ^  km  km 


^  +  2  (log^  +  !)  (los(w)  +  *°K(e)) 


5  *°s  (  r  1  +  los 


/  t  \ 

\we  J 


k . 
2l0g 


yt  \ 

mke  J  ’ 


(49) 


Using  (49)  in  (48),  combined  with  (46)  and  (47),  from  (45)  the  bound  on  P  >  ^i,T,(t-i)  Vi)  is  given  by 


Since  the  mean  regret  is  given  by 

n  m 

e=l  j-1 

the  bound  on  the  mean  regret  at  time  n  is  given  by 


(50) 


E  [fin]  <  £C(flA, 

J-1 

n  m  /  1 

+  ^  Aj  (  £«—  +  (1  -  £t )0j(t) 
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C  The  regret  hound  of  the  soft  UCB  algorithm 

The  regret  at  round  n  is  given  by 


n  m 


K  =  Ec 0')AJ+  E  E  AfG(*>*<w> 

j-l  t-m-t-1  >-l 

<  gG(i)A,+  (t6{)m ax  n)G(i))EA>(Ei1U.= 


The  expected  regret  £[/&„]  at  round  n  is  bounded  by 


E[ft„]  <  E OU )Ay  +  (  max  G(t))  E  AiEl7»l- 

y«€{m+l,-  ,n)  /“ 


(51) 


where  Tj(n)  =  £”=1  tnt=j)  is  die  number  of  times  the  sub-optimal  army  has  been  chosen  up  to  round  n.  Recall  from  (1) 
that 

(52) 


*‘-Tzh)  § 


Prom  the  Chemoff-Hoeffding  Inequality  we  have  that 


and 


Tj(t 


_  V 

•») 


<  exp{-22)(f  -  1)£2}, 


/  i  T,0-»>  \ 

E  <<^{-27^-1)^}. 

Let  us  define  the  following  function: 

m  =(1  +  G(t))  ’ 

by  selecting  e  -  J we  have 


Equivalently,  we  may  write  for  every  j 


(53) 


(54) 


(55) 


Mi  -  J <  %)  with  probabihty  at  least  1  -  £(f)-4, 


lij  +  y  r-(t-^l)'  ~  with  probability  at  least  l-£(f)-4. 
If  we  choose  arm  j  at  round  t  (i.e.,  the  event  {It  —  j}  occurs)  we  have  that 


X<  + 


■2  log i(t) 

T.(t  1) 
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Let  us  use  (57)  to  upper  bound  the  LHS  and  (56)  to  lower  bound  the  RHS  of  (58),  then  we  get  with  probability  at  least 

i  -  mr* 


/21oi«t)> 

V^*- 1) 

from  which  we  get  with  probability  at  least  1  2£(t)-4 

Tj(t  —  1)  <Jj  log £(t).  (59) 

In  order  to  emphasize  the  dependence  of  Xj  from  Tj(t  1)  we  will  sometimes  write  Xjjr9 (t-i)- 1°  Ibe  following,  notice  that 
in  (61)  the  summation  starts  from  m  +  1  because  in  the  first  m  initialization  rounds  each  arm  is  played  once.  Moreover,  step 
(62)  follows  from  (61)  by  assuming  that  arm  j  has  already  been  played  u  times.  By  using  (58)  we  get  (63),  then,  for  each  t, 

/2iog m  .  ?  .  /2iog«o 

+ y t^i -T)  - + \/ s(  _  / 


^  +  ./2l0g^t)  >  min  ft.,,.  + 


Sj  5.€(l,.-.,T*(e-l» 


which  justifies  (64).  We  also  have  that  (60)  is  included  in 


ini r{^+^F2*-~+v^F}- 


Thus,  for  any  integer  u,  we  may  write 

r,(n)  -  i+  £  Mh-i) 

t=m+l 


“+  Y1  1{!t  -  l)>u) 

t—m+ 1 


=  “  +  j£+t 1  +  \/|^=T)- +  /l 12  “} 

^  i  V'  i  \  o  .  /21og£(t)  p  /21og£(t)  1 

<  «+  >  II  <  max  Xj,  4-4/ - —  >  min  A.  +  \  - —  > 

,y|t  K«{«,r3S<t-i)>  *■*>  y  **  . *«*-*))  V  *•  J 


is  equal  to  one,  at  least  one  of  the  following  has  to  be  true: 


X.  <  M.  - 


^0  >  i*j  +  ■ 


/  2  logf(<) 

‘  T,(t-  1)’ 

h  i»g«o 

i)’ 


*  <  ^+2iw^rr  m 

(In  fact,  suppose  none  of  them  hold  simultaneously.  Then  from  (67)  we  would  have  that  X *  >  —  y/ then,  by 


applying  (69)  (with  opposite  verse  since  we  are  assuming  it  does  not  hold)  we  get  X,  >  fj,j  +  2y 


21og$(t)  / 2  log  s(t; 
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then  from  (68)  (again,  with  opposite  verse)  follows  that  A",  >  Xj  +  -  yj  which  is  in  contradiction  with 

(66).)  Now.  if  we  set  u  =  \  % log  ,„) m)  |  ■ - for T* - 1)  > «. 


I*,  -  lij  -  2 

>  /*.-/*# 


21og£(t) 

m- 1) 

2j2hem 


=  /*.  -  /*j  -  Aj 


log  1(0 


\  log  (  max  f(t)) 
>  ft.  -  Hj  -  Aj  =  o, 


therefore,  with  this  choice  of  u,  (69)  can  not  hold. 
Thus,  using  (65),  we  have  that 


Tj(n)  < 


-f^logf  max  £( t )) 
A2  RV*«(m+l.  ■■■.»}  / 


n  T.(t-l) 

E  E  E  »{ s a. 

t=m+l  ».  =  1  *>«  u 

»  T*(t—  1)  T,(t-1) 

E  E  E  M 

t=m+l  a.=l  »,  -u  ( 


21og|(t)| 


and  by  taking  expectation, 

EP>(«)J  < 


wlogf  max  £(<) ) 

;  \*{m+l,  .n}^V7 


„  T.(t— 1)  T,(t-1) 

+  E  E  E 

t=m+l  s.  —  1  *u 


21ogf(e) 


+ 


i=m-f  1 
8 

A? 


„  T.(t-l)  T,(t-1)  f  /7"j  7-T'i 

E  E  E 

=m-fl  s.  —  1  ti  \  V  *  J 

3  log  (  max  £(«))+ 1  +  2  V  ?(i)'4(t  -  1  -  m)2. 
i  V<€{-+1.  .»}  / 


where  in  the  last  step  we  upperbound  T,(f  -  1)  and  l){t  —  1)  by  (#  —  1  —  to)  (cases  where  we  have  only  played  the  best  arm 
or  arm  j).  Therefore,  by  using  (51) 


E[fl„]  <  VG(J')A, 


j=  1 


+  max  G(t) 

t€{m+l,  ,»} 


E  e{JSf.,}«*))  +  EA> 

r-Vi<n*  3  v  7  j-i 


1+  ]T  2(«t))-4(t-l-m)2 

t-3-m  +  l 


D  Regret  hound  proof  for  regulating  greed  with  arm  pool  si/e 

The  regret  at  round  n  is  given  by 

n  m 

ft,=EEAiG(i)1w.  <70> 

l  j**i 
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where  G(t)  is  the  greed  function  evaluated  at  time  t,  1  is  an  indicator  function  equal  to  1  if  arm  j  is  played  at  time  t 
(otherwise  its  value  is  0)  and  A,  -  n*  fij  is  the  difference  between  the  mean  of  the  best  arm  reward  distribution  and  the 
mean  of  the  j’s  arm  reward  distribution.  Similarly.  A*-m*  =  -  Hj  is  the  difference  between  the  mean  reward  of  the 

mdh-best  arm  and  the  mean  reward  of  arm  j  (with  this  definition  we  can  also  write  A  j  Aj  “ 1  and  fi*  By  taking 

the  expectation  of  (70)  we  get 


E[Kn]  =  V  — —  V  AjP  1)  >  %i,Tt(t- 1)  f°r  at  least  m  —  mt  indexes  tj 


(71) 


where  mt  is  the  size  of  the  pool  of  arms  at  time  t,  defined  by  mt  =  min  (m,  max  ^1,  j  j .  We  have  that 

P  (  Xj,T}(t- 1)  >  1)  for  at  least  m-mt  indexes  i ) 

<  F  >  ^•—m,  («— 1)) 

/  /  A*~m<  \ 

<  p  ( W»  >-»  +  ^r-)+ p  (*•«-■> « ■ - -v) 

tlie  last  inequality  follows  from  the  fact  that 


>  X>- 


m,,T._„,(  t- 


-„}<=({ 


A*-™* 

Xj,T,(c-l)  >  hi  4 - ^ — 


|  U  [X,T. 


(t-1)  <  ft' 


A  »  — m( 

ri _ 


(72) 


(73) 


In  fact,  suppose  that  the  inclusion  (73)  does  not  hold.  Then  we  would  have  that 

{■^J.r,(t-i)  >  %•- C  <(*•"”*' - — |  u  >  hi  +  C74) 

X.,T,(t-l)  >  h'~m’ - ^ |  n  |  XI.Tt(t-t)  <hl  +  — ^ 1  .  f75) 

A*  m* j  A*-mt 

but  from  the  intersection  of  events  given  in  (75)  it  follows  that  x,-mt,T.-m  (t-i)  >  /a*-”1'  -  — ^ —  >  fij  —  — —  > 
t_i)  which  contradicts  (74).  Let  us  consider  the  first  term  of  (72)  (the  computations  for  the  second  term  are  similar). 


p  xi.T,( t-i)  >  hi +  ■ 


A*" 


t-t  /  A’_m' 

£p(l}(t-l)  =  s, Xj,.  >  N  +  -L— 

t-t  /  I  a  ,~m' 

=  yp  {TJ(t-D  =  s \xj,.>N+-±— 


Xi.s  >hi  + 


Ar 


A*_m'\  .(A*-”" 

xj,t  —  hi  4  «  I e  » 


(76) 


where  in  the  last  inequality  we  used  the  Chernoff-l  loeffdings  bound.  Let  us  define  Tj* (t  —  1)  as  the  number  of  times  arm  j  is 
played  at  random  tvhen  the  pool  size  is  full  before  round  t  starts,  and  let  us  define 


At  — 
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Then, 


(38)  < 


< 


< 


< 


•  ~mt  \  t-1 

v- +  e 

/  »-lA,J  +  l 


1>«J  / 

£p  2}(i-l)=s|^,.>p,+ 

S=I  \ 

Ep  (  r#(*  -  !)-•!**.  >/*  +  -V-  I  +  *1* - * 

<-l 

IA.J 


-1>.J  + 

o  <*!*”*)’ 

2  -*  >  .  L-Vj 


AT” 


Af 

+  A 

+  A* 


K' "~'i 3 


LAtjP  (T/(t -  1)  <  A*)  +  (A..m,)2e-^ 


l*.J 


(77) 


where  for  the  first  [AtJ  terms  of  the  sum  we  upper-bounded  by  1,  and  for  the  remaining  terms  we  used  the  fact  that 

SlA,  |  +1  e~k*  —  he~kx-  "'hcre  in  our  case  k  =  -j1.  Since  T*(f  -  1)  is  a  sum  of  A  —  Y?s-.i  ^  (m»  =  m}  independent 
Bernoulli  r.v.  with  parameter  l/ms,  we  have  that 


ElT^l-l)]  =  Igl{min(m,max(l,J^))=m}=2At, 

Var(Tf{t-l))  =  ^  (l  -  ^)  g  »  {>nin  (m,  mruc  (l,  =  m}  <  E[T*(i  -  1)], 

and,  using  the  Bernstein  inequality  P(Sn  <  E[5n]  —  a)  <  exp{— with  Sn  =  Tj*(t  —  1)  and  a  =  |E [Tj*(t  —  1)], 
P(jy*(t  - 1)  <  A«)  =  P  (T*(t  - 1)  <  E[T/(t  -  1)]  -  lE[T/(t  -  1)]) 

~  {  E[r«(t-i))  +  iE[r/(f-i)]/ 

=  ocp{-^E[r/(t-l))}=exP|-iAt}.  (78) 

In  order  to  bound  (78)  we  need  \t  >  ylogff)  with  7  >  5  so  tliat  P(TjR(<  —  1)  <  Ac)  <  t  ?  (if  G(t)  does  not  satisfy  the 
requirement  that  At  >  7  log(<)  it  is  easy  to  construct  G'(t)  =  (c-l)/tfort  €  {M+l»-  ,t+2m}  if  7log(<)  >  [7log(f  1)], 

otherwise  G'(t)  =  G(t).) 

Then,  (77)  is  bounded  by 

W)  =  7log(«)(*)-^5  +  (A,_w,)2f - ^ -  (79) 

The  computations  for  the  second  term  in  (76)  are  similar,  therefore 

n  m  « 

E[*»]  <  £  E  ^G(t)— ■fif,  (80) 
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E  Notation  summary 


♦  m:  number  of  arms; 

♦  n:  number  of  rounds; 

♦  G  :  {1,  •  •  •  ,n}  ->  R+:  known  multiplier  function; 

♦  Xj(t):  unsealed  random  reward  for  playing  arm  j; 

♦  Xj(t)G(t):  actual  reward; 

♦  mean  reward  of  the  optimal  arm  (jx *  —  rnaxi<j<m^); 

♦  A y.  difference  between  the  mean  reward  of  the  optimal  arm  and  the  mean  reward  of  arm  j  (A j  —  /x,  -  fxj)\ 

♦  Xy.  current  estimate  of  ny, 

♦  It\  arm  played  at  turn  l\ 

♦  Tj(t  -  1);  number  of  times  arm  j  has  been  played  before  round  t  starts; 

♦  z  tlireshold  (used  in  Algorithm  2  and  Algorithm  4); 

♦  1 :  number  of  rounds  under  the  threshold  z  up  to  time  /; 

♦  k:  a  constant  greater  than  10  such  that  k  >  m|n4  A  in  Algorithm  2  and  Algorithm  3; 

♦  c:  a  constant  greater  than  10  in  Algorithm  7; 

♦  d:  a  constant  such  that  d  <  min^  A  j  and  0  <  d  <  1  in  Algorithm  7; 

♦  et:  probability  of  exploration  at  turn  t  (used  in  Algorithm  2  and  Algorithm  3); 

♦  upper  bound  on  the  probability  of  considering  suboptimal  arm  j  being  the  best  arm  at  round  t  when  using 
Algorithm  2; 

♦  upper  bound  on  the  probability  of  considering  suboptimal  arm  j  being  the  best  arm  at  round  t  when 
using  Algorithm  7; 

♦  upper  bound  on  the  probability  of  considering  suboptimal  arm  j  being  the  best  arm  at  round  l  when  using 
Algorithm  3; 

♦  smoothing  function  used  to  define  the  probabilities  of  exploration  f.t  in  Algorithm  3  (see  Figure  2); 

♦  r-  lowest  value  of  V>(()  (7  =  min,€{m+1> 

♦  n particular  time  defined  as  km  in  the  comparison  between  Algorithm  3  and  Algorithm  7  in  Section  3.4; 

♦  w :  first  round  when  is  less  than  7  (w  =  argmin  /(s),  subject  to  /(«)  <  7,  where  f(s)  —  )  in  the 

comparison  between  Algorithm  3  and  Algorithm  7  in  Section  3.4; 

♦  B  set  of  rounds  when  the  “high  reward"  zone  is  entered  in  Algorithm  4  (B  =  {t :  G(t  —  1)  <  zy  G(t)  >  z})\ 

♦  Vjt  =  {t  :  t  >  ykyG(t)  >  zyt  <  j/*+i}:  set  of  rounds  in  the  high-reward  period  entered  at  time  yk  (k  e 
{1,  •  ,  | B\})  in  Algorithm  4; 

♦  A  k  ~  max  ten  G(0:  highest  value  of  G(t)  on  Yk  in  Algorithm  4; 

♦  Rk  —  Ajk  |  Yk\ :  the  maximum  regret  of  the  kxh  high  reward  zone  in  Algorithm  4; 

♦  £(£):  smoothing  function  used  to  define  the  decision  rule  in  Algorithm  5; 

♦  e:  Iiuler’s  number; 

♦  Rn:  total  regret  at  round  n. 
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Multi- Armed  Bandit  Problem 


Kamesh  Munagala 
August  11,  2017 


Abstract 

In  this  work,  we  mostly  study  the  budgeted  version  of  sequential  decision  making  problem 
when  each  decision  generates  a  reward  and  incurs  a  cost.  Particularly,  we  consider  the  multi- 
armed  bandit  problem  which  has  limited  feedback.  There  are  two  extreme  cases  for  this  problem: 
stochastic  and  adversarial.  Firstly,  we  consider  the  stochastic  version  of  the  problem  in  which 
there  is  a  joint  distribution  for  the  cost  and  reward  of  each  arm  at  each  time  step.  We  compare 
different  algorithms  based  on  some  simulations.  Then,  an  impossibility  result  is  provided  for 
the  fully  adversarial  setting  in  which  the  reward  and  cost  of  each  arm  at  each  time  step  is 
determined  by  an  adversary.  A  new  input  model  named  leaky  bucket  is  offered  to  model  a 
weaker  version  of  adversary.  We  propose  a  simple  policy  for  a  special  case  of  this  new  model. 
Then,  we  consider  the  Markovian  input  model  and  propose  a  new  policy  named  Recency.  We 
compare  this  algorithm  and  another  existing  policy  for  the  stochastic  version  of  the  problem 
by  some  experiments.  Finally,  we  consider  the  contextual  bandit  with  linear  payoffs  (there  is 
no  cost  associated  with  a  decision)  and  try  to  generalize  an  existing  algorithm  for  the  case  in 
which  there  is  an  unknown  parameter  for  the  problem  to  the  case  when  there  is  one  unknown 
parameter  for  each  arm. 


1  Introduction 

Multi-armed  bandit  (MAB)  problems  are  used  to  model  exploration-exploitation  trade-offs  that 
are  inherent  in  many  sequential  decision  making  problems.  Different  versions  of  this  problem  have 
been  studied,  and  different  algorithms  have  been  devised  to  solve  them.  These  algorithms  have  a 
wide  range  of  applications,  including  in  medical  trials,  online  advertising,  recommender  systems, 
and  scheduling. 

The  basic  idea  is  that  we  are  given  a  set  of  options  (or  arms),  and  each  arm  has  an  unknown 
associated  reward  with  it.  At  each  time  step,  we  have  to  decide  which  arm  to  play  so  as  to  maximize 
our  total  reward.  In  MAB  problems  the  feedback  is  limited  which  means  after  playing  an  arm  in 
a  time  step,  the  player  can  only  see  the  reward  of  that  arm  in  the  last  step  and  does  not  obtain 
any  information  about  the  other  options.  The  full  feedback  version  of  the  problem  called  expert 
problem  has  been  also  studied  and  some  of  the  MAB  algorithms  (e.g.  EXP3)  leverage  the  existing 
algorithms  for  expert  problem  to  solve  the  MAB  problem. 

This  problem  can  be  adapted  to  several  different  settings.  We  mainly  consider  the  MAB  problem 
in  the  presence  of  cost  that  each  arm  incurs  a  cost,  as  well  as  produces  a  reward  at  each  time  step 
(note,  the  value  of  the  cost  and  reward  at  each  time  step  may  be  correlated).  There  is  one  budget, 
and  we  can  continue  to  choose  an  arm  at  each  step  until  the  total  amount  of  cost  exceeds  the 
budget.  We  define  our  benchmark  as  the  single  best  arm  (playing  that  arm  at  each  time  step). 
The  regret  of  an  algorithm  is  defined  as  the  difference  between  the  total  reward  of  our  benchmark 
and  the  total  reward  of  the  algorithm.  The  objective  is  to  design  an  algorithm  that  minimizes 
regret.  This  problem  (and  a  more  general  version  of  it)  has  been  already  studied  in  the  stochastic 
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environment  in  which  there  is  a  joint  distribution  for  the  reward  and  cost  of  each  arm,  and  the 
reward  and  cost  for  that  arm  is  an  independent  sample  from  its  unknown  distribution  at  each  time 
step.  In  the  first  part  of  this  work,  we  also  consider  this  case  and  compare  different  algorithms  by 
simulations. 

Then,  we  investigate  whether  similar  results  can  be  obtained  for  a  non-stochastic  environment. 
Particularly,  we  consider  adversarial  model  of  the  problem  in  which  the  reward  and  cost  of  each 
arm  at  each  time  step  is  chosen  by  an  adversary.  For  this  setting,  we  prove  some  negative  results 
showing  the  adversary  is  too  powerful  to  compete  against.  Then,  we  define  a  weaker  adversarial 
model  named  leaky  bucket  input  model  and  propose  a  simple  algorithm  for  this  new  version  of 
the  problem.  Then,  we  consider  a  Markovian  input  model,  in  which  for  each  arm  there  exists  an 
underlying  Markov  chain  with  few  states.  We  propose  a  new  algorithm  named  Recency  to  adapt  to 
this  changing  environment.  We  implement  this  algorithm  and  compare  the  results  in  some  sample 
input  with  the  UCB-Siinplex  which  was  originally  designed  for  stochastic  input  model  in  [9], 

Finally,  we  consider  the  contextual  bandit  with  linear  payoffs  problem.  In  this  version  of  the 
problem,  we  assume  that  the  number  of  rounds  that  we  need  to  pull  an  arm  is  given  and  each 
arm  only  produces  a  reward  at  each  time  step.  In  this  section,  we  try  to  generalize  the  Thompson 
Sampling  algorithm  in  [2]  to  another  version  of  the  problem. 


2  Problem  Statement 


There  are  m  different  arms  and  each  arm  i  produces  reward  r\  €  [0, 1]  and  incurs  cost  €  [0, 1] 
at  time  step  t.  There  are  different  models  for  the  reward  and  cost  of  the  arms.  The  first  one  is 
stochastic  model  which  has  received  a  lot  of  attention:  in  this  model  there  exists  a  fixed  unknown 
joint  distribution  for  the  reward  and  cost  of  each  arm  and  choosing  an  arm  at  a  time  step  results 
in  an  independent  sample  from  its  distribution  determining  the  reward  and  cost  of  that  choice.  We 
mention  some  of  the  existing  results  for  this  model  in  the  next  section.  The  other  model  is  the 
adversarial  model  in  which  the  reward  and  cost  of  each  arm  at  each  time  step  are  determined  by 
an  adversary.  In  this  case,  the  history  of  the  rewards  and  costs  of  an  arm  in  the  previous  rounds 
tells  nothing  about  these  values  in  the  current  time  step. 

The  budget  denoted  by  B  determines  the  stopping  time  of  the  game.  The  player  continues  to 
choose  an  arm  at  each  time  step  until  the  total  amount  of  costs  exceeds  the  budget.  Let  p(t)  denote 
the  chosen  arm  at  time  t  by  policy  p,  the  stopping  time  of  the  policy  p  denoted  by  s(p)  should 
satisfy  the  following: 

»(p)  *(p)+i 

E4«>  -  B  and  E  cUt) >  B 

t= i  (=i 


The  regret  of  an  algorithm  is  defined  as  the  difference  between  the  total  reward  of  a  benchmark 
and  the  total  reward  of  the  algorithm.  In  this  work,  we  consider  the  best  single  arm  policy  as  the 
benchmark.  Let  p,-  denote  the  single  arm  policy  that  always  play  arm  i,  then  for  any  policy  p  the 
regret  is  defined  as  follows: 

s(p)  s(p() 


R(p)~ 


=  S' 


p(<) 


—  max 


3  Related  Work 

Ding  et.  al,  in  [7],  study  the  multi-armed  bandit  problem  with  budget  constraint  and  variable 
costs  for  the  first  time.  In  their  setting,  there  exists  one  distribution  for  the  reward  of  each  arm. 
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In  addition,  there  exists  a  multinomial  distribution  for  the  cost  of  each  arm.  They  assume  that 
the  cost  is  between  0  and  1  and  it  is  always  a  multiple  of  the  unit  cost.  The  reward  and  cost  of 
arm  i  at  each  time  step  is  an  iid  sample  from  the  corresponding  distributions.  They  propose  two 
algorithms  for  this  problem  with  regret  bound  of  0(ln  B). 

Badanidiyuru  et  al.  in  [5]  have  tackled  a  more  general  bandit  with  knapsacks  (BWK)  problem. 
In  this  setting,  there  are  d  different  resources,  each  one  with  its  own  budget.  The  process  of  choosing 
an  arm  continues  until  at  least  one  of  the  resources  has  been  fully  consumed.  They  particularly 
looked  at  a  stochastic  version  of  the  problem,  where  the  reward  and  costs  for  each  arm  is  an 
independent  sample  from  an  unknown  joint  distribution.  In  this  general  setting,  the  best  single 
arm  may  not  be  a  strong  benchmark.  For  example  if  there  exist  2  different  resources  and  2  arms 
where  arm  1  only  consumes  the  first  resource  and  arm  2  only  consumes  the  second  resource,  an 
optimal  policy  should  play  both  of  the  arms.  They  consider  the  stronger  benchmark  which  is  the 
optimal  dynamic  policy  when  the  joint  distribution  is  known  upfront. 

They  propose  two  algorithms,  with  regret  0(VmOPT  +  OPTyp^).  Here,  m  is  the  number  of 
arms,  B  is  the  minimum  of  all  the  budgets,  and  OPT  is  the  expected  reward  of  the  optimal  policy 
(the  O  hides  logarithmic  factors.) 

Flajolet  et  al.  in  [9]  proposes  the  UCB  -  Simplex  method  (will  be  described  in  the  next 
section),  which  has  logarithmic  regret.  They  analyze  this  algorithm  for  three  different  cases  in 
terms  of  the  number  of  resources:  a  single  resource,  two  resources  where  one  of  them  is  time  (the 
time  consumption  for  each  arm  is  deterministic  and  equal),  and  an  arbitrary  number  of  resources 
(in  the  third  case,  the  costs  for  each  arm  are  deterministic).  The  single  resource  case  is  most 
relevant  to  our  problem,  and  it  has  regret  0(ln  B). 

All  of  the  aforementioned  results  are  based  on  the  assumption  that  the  reward  and  cost(s)  for 
each  arm  are  stochastic  and  the  environment  remains  unchanged.  However,  we  believe  that  this 
assumption  does  not  accurately  model  practical  use  cases.  We  are  investigating  this  problem  for 
different  non-stochastic  settings. 

The  first  non-stochastic  model  is  the  classic  adversarial  model  defined  in  [4].  In  this  model 
an  adversary  produces  the  reward  and  cost  at  each  round.  Auer  et  al.  in  [4]  consider  the  bandit 
problem  with  a  single  reward  for  each  arm  at  each  time  step.  In  their  setting  the  only  resource 
is  time  (time  horizon  T  is  the  budget)  and  each  arm  deterministically  consumes  one  unit  of  the 
resource  at  each  time  step.  They  use  the  Hedge  algorithm  which  guarantees  that  the  regret  in  the 
expert  problem  (full  feedback)  is  at  most  0(\/T  lnm)  and  designed  the  EXP3  algorithm  where 
the  regret  is  0(\/Tm  logm)  where  to  is  the  number  of  arms.  There  is  no  result  when  the  resource 
consumption  of  each  arm  is  also  chosen  by  an  adversary.  In  the  next  section,  we  will  investigate 
whether  in  this  version  of  the  problem  it  iB  possible  to  compete  against  the  best  single  arm. 

Slivkins  and  Upfal  in  [  10]  consider  a  changing  environment  in  which  the  expected  reward  of  each 
arm  can  change  according  to  a  Brownian  motion.  Their  problem  is  a  special  case  of  the  restless 
bandit  problem  in  which  the  expected  reward  of  each  arm  is  the  state  of  that  arm.  They  define 
p,(t)  €  [0, 1]  (expected  reward  of  arm  i)  as  the  state  of  the  arm  i  at  time  t.  Then,  at  time  t  +  1, 
Xi(t)  which  is  a  sample  from  Af(0,  f,)  will  be  added  to  the  p(t)  and  determines  the  value  of  p(t+ 1) 
(the  value  should  be  in  [0, 1],  therefore  the  interval  has  reflecting  boundaries).  Volatility  of  the  arm 
i  denoted  by  <r,-  is  known  to  the  algorithm.  They  propose  algorithms  for  the  state  informed  (in 
which  the  feedback  not  only  contains  the  reward  but  also  the  state  of  the  arm)  and  state  oblivious 
versions  of  this  problem.  However,  their  work  does  not  consider  the  problem  in  the  presence  of 
cost. 

Another  line  of  work  focuses  on  Thompson  sampling  which  is  a  Bayesian  algorithm.  Agrawal 
and  Goyal  in  [1|  prove  that  the  expected  regret  of  Thompson  sampling  for  the  stochastic  multi 
armed  bandit  problem  (without  cost)  is  logarithmic  in  time  horizon.  They  assume  Bayesian  priors 
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for  the  mean  reward  of  each  aim  by  using  Beta  distributions  (their  analysis  is  prior-free).  A  very 
natural  extension  of  this  work  for  the  budgeted  version  of  this  problem  is  to  have  tow  different 
priors  tor  each  arm  (one  for  cost  and  one  for  reward).  In  this  case,  we  can  sample  both  of  the 
reward  and  cost  of  each  arm  from  the  corresponding  Beta  distributions  and  play  the  arm  with 
the  highest  ratio  of  the  reward  to  cost.  Xia  et  al.  in  [12]  propose  this  extension  and  prove  that 
the  distribution  dependent  regret  bound  of  this  algorithm  is  0(ln  B).  Agarwal  and  Goyal  in  [2] 
generalized  the  Thompson  sampling  algorithm  for  the  stochastic  contextual  bandit  problem  with 
linear  payoffs. 

4  Stochastic  Budgeted  MAB 

In  this  section,  we  generalize  the  Thompson  Sampling  algorithm  presented  in  [1]  for  the  stochas¬ 
tic  MAB  problem  to  the  budgeted  version  of  the  problem  in  the  presence  of  cost.  Then,  we  run 
some  experiments  and  show  that  this  algorithm  works  better  than  other  existing  non-Bayesian 
(UCB  based)  methods  for  this  problem.  Xia  et  al  in  [12]  independently,  proposed  the  same  algo¬ 
rithm  and  showed  similar  results  in  their  work. 

Firstly,  we  explain  the  Thompson  Sampling  for  the  stochastic  MAB  introduced  in  [1]:  There 
are  n  different  arms  and  there  is  a  time  horizon  T.  There  exists  an  unknown  distribution  for  each 
arm  and  the  reward  of  that  arm  in  each  time  step  is  an  iid  sample  of  its  corresponding  distribution. 
In  this  algorithm,  for  each  arm  i  there  are  two  attributes:  Si  and  F*  (initialized  to  zero)  and  the 
prior  for  the  mean  reward  of  arm  i  is  Beta(Si  +  1,  F,  +  1).  At  each  time  step  the  algorithm  samples 
one  value  from  the  prior  of  each  arm  and  play  the  arm  i  with  the  highest  sampled  value.  Then  it 
observes  the  reward  r  of  that  arm  as  feedback,  and  updates  the  prior  of  that  arm  as  follows:  with 
probability  r  the  value  of  S',  will  be  increased  by  one  and  otherwise  (with  probability  1  —  r)  the 
value  of  Fi  will  be  increased  by  one. 

The  natural,  generalization  of  the  above  algorithm  for  the  budgeted  version  of  the  problem 
with  two  unknown  distributions  for  each  arm  is  to  keep  two  Beta  distributions  for  each  arm  as 
priors:  one  for  reward  and  one  for  cost.  Therefore,  each  arm  i  has  four  attributes:  Set,  Fc,,  Sr, , 
and  F r,.  Then,  at  each  time  step  we  can  take  two  samples  for  each  arm  i:  One  sample  from 
Beta(Sri  +  i,  Fr,  +  1)  for  the  reward  of  arm  i  and  one  sample  from  Beta(Sci  +  t,  Fcj  +  1)  for 
the  cost  of  arm  i.  Then  we  should  play  the  arm  with  the  highest  ratio  of  the  sampled  reward  to 
sampled  cost.  The  update  step  is  exactly  the  same  for  each  distribution.  In  other  words,  if  after 
playing  arm  i  at  the  last  time  step  the  observed  value  for  reward  and  cost  are  r  and  c  then  with 
probability  r  (c)  we  increase  Sr,  (Scj)  by  one  and  otherwise  we  increase  Fr,  (Fq)  by  one. 

4.1  Experiments 

Firstly,  we  describe  the  other  non-Bayesian  algorithms  used  in  our  simulation.  Then,  we  provide 
the  results. 

4.1.1  UCB1 

We  do  not  use  this  algorithm  in  our  simulation  but  this  is  the  basic  idea  of  the  other  algorithms 
and  we  briefly  explain  it  here.  This  algorithm  was  designed  in  [3]  for  the  stochastic  case  when  each 
arm  only  produces  a  reward  (no  cost)  at  each  time  step.  It  is  a  simple  index  based  approach:  after 
initialization  step  (playing  each  arm  once),  at  each  round  UCB1  chooses  the  arm  i  maximizing 
ij  +  where  x,  is  the  empirical  average  of  arm  i,  n  is  the  total  number  of  times  that  the  arms 

has  been  played  up  to  this  point,  and  n,  is  the  number  of  times  that  arm  i  has  been  played  so  far. 
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Note  that,  the  first  term  Xi  is  the  exploitation  term  encouraging  playing  arms  with  high  estimated 
reward  so  far  and  the  second  term  is  the  exploration  term  encouraging  playing  arms  which 

has  been  played  fewer  so  far. 

The  idea  used  in  this  algorithm  is  called  optimism  in  the  face  of  uncertainty.  For  each  arm  i, 
Hi  +  is  the  highest  value  (optimistic  estimate)  of  a  high  probability  confidence  interval  for 

the  expected  reward  of  that  arm.  This  value  is  obtained  from  the  Hoeffding’s  inequality.  Note  that, 
changing  the  probability  of  the  confidence  interval  results  in  other  UCB  algorithms  (e.g.  UCB2). 

4.1.2  UCB-Simplex 

This  algorithm  introduced  by  Flajolet  et  al.  in  [9],  The  basic  idea  of  this  method  consists  of 
two  main  steps.  In  the  first  step,  the  algorithm  finds  an  optimal  basis  of  arms  by  solving  an  LP 
using  the  simplex  method.  In  the  second  step,  the  algorithm  must  decide  which  of  the  arms  in  the 
optimal  basis  to  play. 

The  LP  is  as  follows: 


max  Efc-i(r(M)  +  P  ■  £M) '  & 
subject  to  1  & '  c;(M)  —  £(*)  f°r  each  resource  i 
t;ic  >  0  for  each  arm  k 

Here,  ^(k,t)  and  ^  are  the  empirical  estimates  of  the  reward,  and  cost  of  resource  i  for  arm 
k  at  time  step  t.  The  parameter  /S  determines  how  much  exploration  there  is  and  denotes  the 
number  of  times  that  arm  k  has  been  played  so  far.  Lastly,  which  is  obtained  from 

the  Hoeffding’s  inequality  when  the  probability  is  1  —  ^  (nktt  is  the  number  of  times  arm  k  has 
been  played  until  time  step  t).  Note  that  this  is  the  same  with  the  exploration  term  in  UCB1. 

Solving  this  LP  is  a  general  form  technique.  In  our  problem,  since  we  are  dealing  with  only  one 
resource,  there  is  only  one  restraint  so  there  is  no  need  to  solve  a  LP.  Similarly,  for  the  second  step 
there  will  only  be  one  arm  in  the  optimal  basis.  Since  there  is  one  resource,  we  will  omit  the  index 
i  notation. 

Algorithm  Description: 

•  Set  /3  =  1  +  j,  where  A  is  a  lower  bound  on  the  average  cost  of  any  arm. 

•  Exploration  phase:  pull  each  arm  in  a  round  robin  fashion  until  the  empirical  cost  for  each 
arm  is  non-zero. 

•  Exploitation  phase:  pull  the  arm  with  the  largest  UCB  — 
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4.1.3  UCB-BV2 

This  algorithm  is  introduced  by  Ding  et  al.  in  [7].  After  initialization  step  (pulling  each  arm 
once)  the  index  of  arm  i  at  time  l  is  defined  by  the  following  formula: 


Dif  =  ^  +  +  ■ 

Ci,l  \ 


7 


M<i) 

rn,c 


ln(<  —  1) 
*h,t 


where  fy  and  are  the  empirical  average  reward  and  cost  of  arm  i,  n,-<  is  the  number  of  times 
that  arm  i  has  been  played  so  far,  and  A t  is  the  minimum  empirical  cost  among  the  arms  at  time  t. 
They  tried  to  make  the  algorithm  independent  of  any  knowledge  about  the  distributions  by  using 
this  value  instead  of  the  lower  bound  on  the  mean  cost  of  the  arms  (like  UCB-simplex)  .  At  each 
step,  algorithm  pulls  the  arm  with  the  highest  index  until  it  exhausts  the  budget. 


4.1.4  BWK 

In  this  algorithm  we  use  the  upper  confidence  bound  for  the  reward  and  lower  confidence  bound 
for  the  cost  of  each  arm  and  pulls  the  arm  with  the  highest  ratio.  This  is  the  natural  extension 
of  optimism  in  the  face  of  uncertainty  that  we  have  seen  in  UCB1  to  the  budgeted  version  of  the 
problem.  We  use  the  name  BWK  because  the  usage  of  lower  and  upper  confidence  bound  is  similar 
to  [5].  However,  that  work  consider  a  more  general  setting  and  their  algorithm  is  more  complicated. 

4.1.5  Results 

•  Simulation  1:  This  simulation  is  similar  to  the  simulations  of  [7], There  are  10  different  arms: 
The  reward  of  each  arm  is  drawn  from  a  Bernoulli  distribution,  and  the  parameter  p;  (proba¬ 
bility  of  success)  for  each  arm  *  is  set  randomly  at  the  beginning  of  the  simulation.  The  cost  of 
each  arm  is  drawn  from  a  multinomial  distribution  and  is  from  the  set  (0, 1/100, 2/100 ...  1} . 
The  parameters  of  the  multinomial  distribution  of  each  arm  is  also  set  randomly  for  each 
arm.  We  run  this  simulation  50  times  and  plot  the  average  regret  for  each  of  the  algorithms: 
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|  -  UCB-Simplex  —  Thompson  BWk  bvv2 1 


Figure  1:  Simulation  1 


»  Simulation  2:  In  this  simulation  we  consider  S  arms.  For  the  first  arm  the  reward  at  each 
step  is  0.6  with  probability  0.5  and  0.3  otherwise.  For  the  second  arm  the  reward  at  each 
time  step  is  0.8  with  probability  0.7  and  0.1  otherwise.  For  these  two  arms  the  cost  is  always 
1  —  reward.  For  the  third  arm  the  cost  and  reward  are  independent:  the  reward  is  0.7  with 
probability  0.8  and  0.3  otherwise.  The  cost  for  this  arm  is  0.2  with  probabilty  0.9  and  0.1 
otherwise. 

We  run  the  simulation  for  50  times  and  the  average  regret  of  different  algorithms  for  different 
budget  values  are  as  follows: 
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—  UCB-Simplex  — Thompson  BWK  bvv2| 

Figure  2:  Simulation  2 


As  we  can  see  in  both  of  the  plots  Thompson  Sampling  algorithm  has  a  lower  regret  than  the  other 
UCB  based  algorithms, 

5  Adversarial  Budgeted  MAB 

In  this  section,  we  study  the  adversarial  model  in  the  presence  of  cost.  As  we  have  mentioned 
earlier,  the  current  algorithms  for  adversarial  case  only  works  with  a  time  horizon  which  is  a  very 
special  case  of  a  cost.  This  section  contains  two  parts.  First,  we  consider  the  fully  adversarial 
model  which  is  the  conventional  adversarial  model  in  which  the  reward  and  cost  of  each  arm  at 
each  time  step  are  determined  by  an  advesary.  Then,  we  introduce  a  new  model  which  is  a  restricted 
adversarial  model  called  leaky  bucket  input  model. 

5.1  Fully  Adversarial  Model 

In  this  problem,  there  are  m  different  arms  and  a  single  budget  5,  playing  an  arm  produces  a 
reward  and  incurs  a  cost.  At  each  time  step,  we  can  choose  one  arm  until  sum  of  the  cost  of  chosen 
arms  over  the  time  steps  exceeds  the  budget.  The  reward  and  cost  at  each  step  can  be  any  value 
(determined  by  an  adversary)  in  [0, 1] .  We  assume  that  the  feedback  at  the  end  of  each  time  step 
reveals  performances  of  all  of  the  arms  at  the  last  time  step  (expert  problem),  and  the  adversary  is 
oblivious.  We  show  some  negative  results  for  this  full  feedback  version  which  is  easier  than  MAB 
problem, 

Lemma  1.  No  algorithm  can  obtain  more  than  -OPT  where  OPT  is  the  optimal  single  arm  policy. 
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Proof.  Consider  the  following  simple  example:  in  the  first  [BJ  time  steps  for  all  of  the  arms  the 
cost  is  1  and  reward  is  0.  Then  in  the  time  step  [BJ  +  1  only  the  reward  of  one  of  the  arms  is  1 
while  its  cost  is  0.  For  the  other  arms  the  reward  and  cost  are  still  0  and  1,  respectively.  Even  if 
the  algorithm  knows  the  structure  of  input,  it  cannot  guarantee  more  than  ~  of  the  OPT  and  the 
reason  is  that  it  should  choose  one  of  the  arms  randomly. 

□ 

From  the  example  in  the  proof  of  the  previous  lemma,  it  seems  helpful  to  relax  the  budget 
constraint.  In  other  words,  the  algorithm  can  spend  a  budget  which  is  more  than  the  actual  budget 
(a  function  of  the  actual  budget)  and  it  should  compete  against  the  best  single  arm  that  consumes 
the  actual  budget. 

If  the  algorithm  is  allowed  to  spend  nB  it  is  obvious  that  it  can  compete  against  the  best  single 
arm  with  budget  B  (we  can  run  the  weighted  majority  algorithm  and  ignore  each  expert  after  it 
consumes  budget  B  in  the  previous  rounds). 

Now,  we  want  to  consider  the  case  where  a  constant  c  is  given  and  the  relaxed  budget  is  cB. 
The  algorithm  should  compete  against  the  best  single  arm  with  budget  B.  In  the  next  lemma  we 
show  that  it  is  not  possible  to  have  an  algorithm  that  spends  cB  budget  and  guarantees  a  reward 
which  is  a  constant  ratio  of  the  reward  of  the  best  single  arm. 

Lemma  2.  Constant  values  ci  and  ci  are  given.  Let  OPT  be  the  reward  of  the  best  single  arm  with 
budget  B.  It  is  not  possible  to  design  a  policy  which  guarantees  4. OPT  reward  with  C2B  budget. 

Proof.  We  prove  it  by  contradiction:  assume  that  c\  is  integer  (otherwise  consider  [cj  ] ) .  If  there 
exists  such  an  algorithm,  it  must  work  on  every  input.  Consider  the  following  n  different  inputs: 
the  first  expert  only  has  non-zero  reward  and  cost  for  the  first  2 B  steps  such  that  in  B  of  them  the 
reward  is  1  and  the  cost  is  0  and  in  the  other  B  steps  the  reward  is  0  and  the  cost  is  1.  In  the  ith 
input  the  last  n  —  *  experts  always  have  cost  0  and  reward  0.  For  every  2  <  j  <  i  the  yth  expert 
produces  non-zero  reward  and  cost  only  at  the  time  steps  between  Sj  =  (2 cj  + 1  y~2B  +  (j  —  1)B+ 1 
and  Ej  =  (2ci  +  1)^_1B  +  jB  such  that  in  B  of  them  it  has  reward  0  and  cost  1  and  in  the  other 
(2ci  +  1)J_2  x  2ciB  steps  it  has  reward  1  and  cost  0. 

The  algorithm  should  results  in  a  desirable  outcome  for  all  of  the  n  different  inputs.  We  can  see 
that  it  should  choose  expert  i  in  at  least  fraction  of  the  times  that  this  expert  has  non-zero 
reward  or  cost.  The  reason  is  that  the  algorithm  should  work  for  the  ith  input,  and  for  each  j  >  i, 
the  ith  input  and  jth  input  are  the  same  before  time  step  E,.  Therefore,  for  the  nth  input  the 
expected  cost  will  be  ^-B  which  is  not  a  constant  factor  of  the  budget.  □ 

Note  that  in  our  proof  the  adversary  only  chooses  one  of  the  two  possibilities  for  each  arm  at 
each  time  step:  reward  1  with  cost  0  and  vice  versa.  It  is  worthwhile  to  mention  that  the  structure 
of  the  proof  is  similar  to  [8]. 

5.2  Leaky  Bucket  Input  Model 

As  we  showed  earlier,  fully  adversarial  model  is  too  strong.  In  this  part,  we  introduce  a  new 
input  model  called  Leaky  bucket  input  model  which  is  a  weaker  version  of  the  adversarial  model. 

Input  Model: 

•  Each  arm  i  has  four  unknown  values:  (rj,Oj)  and  (ci,bi). 

•  Assume  r;,c;  €  (0, 1]  for  each  arm  i. 
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•  For  each  arm  a,  until  time  t  (any  window  of  time  starting  from  time  0),  the  total  reward 
€  [rjf  —  a;,  Tit],  and  total  cost  €  [ Cit,Cit  +  6,]. 

Here,  it  is  possible  for  a,  and  6*  to  be  functions  of  time.  However,  we  consider  the  simple  case 
where  these  are  fixed.  Note  that  this  case  is  similar  to  stochastic  version  in  which  the  expected 
value  of  the  reward  and  cost  of  arm  i  are  r;  and  c,,  respectively.  However,  at  each  step  playing 
an  arm  is  not  an  independent  sample  from  a  joint  distribution  with  those  expected  values.  An 
adversary  can  choose  the  values  but  we  restrict  the  power  of  the  adversary  by  parameters  a;  and 
6j.  In  other  words,  the  adversary  cannot  deviate  much  from  the  actual  values  of  the  arms. 

First,  note  that  sampling  the  arms  like  the  other  input  models  is  not  helpful  anymore.  The 
reason  is  that  the  adversary  can  control  the  input.  For  example,  the  adversary  can  choose  a  num¬ 
ber  of  small  cost  followed  by  a  very  large  cost  for  an  arm.  In  this  case  we  cannot  use  Hoeffding’s 
inequality  and  other  similar  techniques.  The  only  option  is  to  consecutively  play  an  arm  for  some 
time  steps  in  order  to  find  the  expected  values.  Since  we  do  not  know  the  parameters  for  each  arm, 
the  best  option  is  to  play  each  of  them  until  it  consumes  a  certain  amount  of  budget.  After  this 
sampling  step  we  will  play  the  best,  arm  according  to  sampling  step  for  the  rest  of  the  time  steps. 
We  propose  a  simple  policy  for  this  model: 

Like  before,  let  B  be  the  total  budget,  and  let  m  be  the  total  number  of  arms.  Let  e  be  a  parameter. 
Algorithm  Description: 

♦  Exploration:  Play  an  arm  continuously  until  tfi  of  cost  is  consumed  for  that  arm.  Then 
switch  to  another  arm  and  do  the  same  until  all  arms  are  explored. 

«  Exploitation:  Pick  the  arm  with  the  highest  J,  and  play  this  arm  for  the  remainder  of  budget. 

The  value  of  e  will  be  set  so  as  to  minimize  the  regret.  A  similar  exploration  and  then  exploitation 
policy  has  been  defined  in  [11]  for  the  stochastic  version  of  the  problem.  One  possible  future 
direction  is  to  study  the  problem  when  a;  and  bi  are  functions  of  time  (instead  of  constant  values). 

Ci  Recency  Policy:  Detecting  the  Changes  in  the  Environment 

As  mentioned  earlier,  we  are  interested  in  MAB  problems  in  a  non-stochastic  setting.  We  consid¬ 
ered  adversarial  settings  in  the  previous  section.  In  this  section,  we  consider  another  non-stochastic 
setting  called  Markovian  input  model.  In  this  setting,  each  arm  might  have  multiple  states,  and 
each  state  has  an  associated  reward  and  cost.  There  are  transition  probabilities  specifying  the 
transition  of  arms  from  a  state  to  another. 

We  propose  Recency  which  is  a  new  algorithm  to  solve  the  problem:  as  the  name  may  suggest, 
this  algorithm  gives  more  importance  to  recent  events  in  the  cost  and  reward  sequence.  The  intu¬ 
ition  behind  this  is  as  follows:  in  certain  settings,  the  characteristics  of  the  arms  may  change  over 
time,  thus  it  may  be  inefficient  and  inaccurate  to  consider  older  information.  The  algorithm  has 
some  sample  and  update  steps.  In  any  time  step  which  is  not  a  sample  step  or  update  step,  the 
algorithm  chooses  the  arm  picked  in  the  last  update  step.  In  sample  steps,  it  randomly  samples  the 
rewards  and  costs  for  different  arms  and  updates  their  empirical  estimates.  In  each  update  step,  it 
uses  those  estimated  values  to  pick  the  arm  with  the  highest  ratio  of  reward  to  cost.  In  addition,  it 
reset  (or  multiply  by  a  discount  factor)  the  previous  estimated  values  for  the  arms.  The  sampling 
and  update  steps  depend  on  the  budget  consumed,  but  at  random  points  based  on  a  sampling  from 
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an  exponential  distribution.  We  define  the  following  two  quantit  ies:  sampling  rate  =  s  =  and 

-2 

update  rate  =  u  =  B  . 

Algorithm  Description: 

•  Sample  Time  (determined  by  sampling  rate  s) 

-  Randomly  choose  arm  k 

-  Update  the  total  reward  r*,,  total  cost  cj.,  and  number  of  plays  n*. 

Set  the  next  sample  time  to  when  the  cost  consumed  is  Exponentials ). 

2 

-  (The  number  of  sampling  steps  is  roughly  St/m). 

•  Update  Time  (determined  by  update  rate  u) 

-  Choose  arm  i  with  the  largest  j*.  Use  this  arm  until  the  next  update. 

—  For  all  arms  k,  reset  r^,ct,  and  n*  to  0. 

Set  the  next  update  time  to  when  the  cost  consumed  is  Exponenlial(u ). 

-  (The  number  of  updating  steps  is  roughly  Bi). 

As  opposed  to  algorithms  designed  for  a  stochastic  setting,  this  algorithm  can  detect  changes 
in  the  states  of  the  arms.  Additionally,  since  the  sampling  frequency  depends  on  the  budget 
consumed,  it  will  not  spend  too  much  budget  on  sampling.  However,  since  it  periodically  sample 
all  the  arms  it  is  obvious  that  its  regret  can  be  larger  than  the  algorithms  that  do  not  consider 
changing  environment  in  some  cases. 

6.1  Experiments 

Here,  we  consider  different  scenarios  and  compare  the  regret  of  our  algorithm  with  UCB-Simplex 
(described  in  stochastic  MAB  section). 

Below  you  can  see  the  results  of  the  simulations  comparing  UCB-Simplex  and  Recency: 

For  each  state,  the  first  entry  is  the  reward,  and  the  second  entry  is  the  cost.  The  transition 
probability  shows  the  probability  of  a  change  from  one  state  to  the  other  one.  In  the  plots,  the 
x-axis  are  the  different  budgets,  and  the  i/-axis  represents  the  total  regret  with  regards  to  OPT. 

•  Simulation  A: 

In  this  simulation  we  have  2  arms. 


Arm 

State  1 

State  2 

Transition  Probability 

1 

(0.5,  0.5) 

- 

2 

(0.1,  1) 

(1,  1) 

SI  ->  S2:  0.0001,  S2  ->•  SI: 
0.001 

•  Simulation  B: 

In  this  simulation  we  have  3  arms.  The  first  two  arms  are  the  same  with  the  two  arms  of 
simulation  A.  The  third  one  with  small  cost  and  reward  is  added  to  the  options: 
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Arm 

State  1 

State  2 

Transition  Probability 

1 

(0.5,  0.5) 

- 

2 

(0.1,  1) 

(1,1) 

SI  ->  S2:  0.0001,  S2  -t  SI: 
0.001 

3 

(0.005.  0.05) 

(0.05,  0.05) 

SI  ->  S2:  0.0001,  S2  ->  SI: 
0.001 

Figure  1  shows  the  results  of  the  simulations  A  and  B  : 


Figure  3:  Simulations  A  and  B. 


Although  the  two  scenario  are  almost  the  same  (the  extra  arm  has  the  same  ratio  of  reward 
to  cost  with  the  second  one),  regret,  of  UCB-Simplex  is  very  different  for  these  scenarios.  The 
reason  is  that  the  known  lower  bound  is  much  smaller  for  the  second  scenario  and  even  for 
large  value  of  budget  the  algorithm  explore  in  most  of  the  time  steps.  Recency  periodically 
sample  the  arms  but  it  does  not.  depend  on  the  lower  bound.  In  this  scenario,  the  Recency 
behave  the  same  for  the  two  scenarios. 

Simulation  C: 

In  this  simulation  we  have  2  arms: 


Arm 

State  1 

State  2 

Transition  Probability 

1 

(1.  1) 

(0.3, 0.6) 

0.0001 

2 

(0.9,  1) 

(1,  0.7) 

0.0001 

Simulation  D: 

In  this  simulation  we  have  2  arms  which  are  the  same  and  start  from  different  states: 


Arm 

State  1 

State  2 

Transition  Probability 

1 

(1,  0.1) 

(0.1,1) 

0.001 

2 

(0.1,  1) 

(1,0.1) 

0.001 
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Figure  2  shows  the  results  of  the  simulations  C  and  D. 


Figure  4:  Simulations  C  and  D 


Simulation  E: 

In  this  simulation  we  have  four  arms: 


Arm 

State  1 

State  2 

Transition  Probability 

1 

(1.0.  0.01) 

(0.01,  1.0) 

0.001 

2 

(0.9.  0.1) 

(0.1,  0.9) 

0.001 

3 

(0.8,  0.15) 

(0.15,  0.8) 

0.001 

4 

(0.7.  0.2) 

(0.2,  0.7) 

0.001 

Simulation  F: 

In  this  simulation  there  are  two  arms: 


Arm 

State  1 

State  2 

Transition  Probability 

1 

(1,  0.0001) 

(0-1,  1) 

0.0001 

2 

(0.01,  1.0) 

(1.  1) 

0.0001 

Figure  3  shows  the  results  of  the  simulations  E  and  F. 
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Figure  5:  Simulations  E  and  F 


As  we  can  see  in  scenarios  C,D.F  the  performance  of  the  two  algorithms  are  similar.  In  scenario  E, 
UCB-Simplox  has  a  lower  regret  .  As  we  mentioned  earlier  since  our  algorithm  sample  all  the  arms 
periodically,  it  is  not  surprising  that,  in  some  case  UCB-Simplex  has  a  lower  regret.  However,  it 
seems  that  our  algorithm  can  detect  the  changes  in  some  cases.  In  addition,  it.  does  not.  depend  on 
the  lower  bound  of  the  cost  consumption  which  can  be  problematic  in  some  cases  like  scenario  B. 

7  Contextual  MAB  with  Linear  Payoffs 

In  this  section,  we  generalize  the  existing  Thompson  sampling  algorithm  for  a  contextual  bandit 
problem  with  linear  payoffs,  explained  in  [2],  to  another  contextual  bandit  problem  with  linear 
payoffs.  Firstly,  we  explain  the  first  problem  and  algorithm  presented  in  that  work.  Then,  we 
explain  the  other  problem  and  our  idea  for  using  the  Thompson  sampling  algorithm  for  it.  Finally, 
we  implement  the  suggested  algorithm  for  the  second  problem  and  show  that  it  has  a  low  regret 
by  some  experiments. 

In  [2],  at  each  time  step  f,  a  context.  6,(f)  €  R'*  is  given  for  each  arm  i  (by  an  adversary).  In 
addition,  there  is  an  unknown  parameter  p  €  R'(  for  the  problem.  The  reward  of  playing  arm  i  at 
time  t  is  6;(f)' /7  plus  a  noise  which  is  a  sample  from  a  /I-sub-Ganssian  distribution.  The  goal  is  to 
compete  with  the  predictor  who  knows  the  parameter  p  and  plays  the  arm  i  maximizing  b,(t)Tp 
at  time  step  t.  In  the  Thompson  sampling  algorithm  proposed  for  this  problem,  at  each  time  step 
they  sample  p.(t)  from  a  multivariate  Gaussian  distribution  AT(p,v2B~l)  and  play  the  arm  with 
the  highest  b,(t)T p(t).  The  initial  values  are  as  follows:  B  =  /,/.  p  =  f  =  (J,j.  After  playing  such  an 
arm  i  and  observing  reward  r«,  the  update  step  is  as  follows:  B  =  B  +  b,(t)b,(t)T .  f  =  f  +  6,(f)n, 
and  p  =  f . 

We  wish  to  generalize  the  mentioned  Thompson  sampling  algorithm  to  the  following  problem: 
At  each  time  step,  a  single  context  6(f)  £  R'1  is  given  which  is  drawn  from  any  distribution  (it  is  not 
chosen  by  an  adversary  anymore).  Each  arm  i  has  an  unknown  parameter  pi  £  Rrf.  The  reward  of 
playing  arm  i  at  time  f  is  6(f) ;  p,  plus  a  noise  which  is  a  sample  from  a  /?-sul>-Gaussian  distribution. 
Note  that  in  this  problem  the  context  specifies  features  of  the  environment  at  that  time  step  (not 
features  of  the  arm).  This  is  the  reason  that  we  have  only  one  context  vector  at  each  time  step. 
This  problem  was  defined  in  [6].  They  provide  an  interesting  motivation  for  this  problem.  In  their 
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work  they  mention  that  each  arm  can  be  a  special  treatment  and  the  context  might  represent  the 
characteristics  of  the  patient  at  that  time  step.  This  example  justifies  the  assumption  that  there  is 
a  single  context  at  each  time  step  which  is  drawn  from  a  distribution.  This  problem  is  harder  than 
the  previous  one  because  for  each  arm  there  is  an  unknown  parameter  and  we  need  to  estimate  all 
of  them  and  at  each  time  step  we  can  only  update  one  of  the  estimated  values. 

In  order,  to  generalize  the  proposed  algorithm  for  the  new  setting  we  have  one  multi-variated 
Gaussian  prior  for  each  of  the  arms  and  take  a  sample,  £;((),  for  each  arm  i  at  each  time  step  t. 
Then  we  play  the  arm  i  which  maximizes  b(t)'  jii(t)  and  observe  the  reward  r.  Finally,  we  update 
the  prior  of  the  arm  i  just  as  before.  We  implement  this  algorithm  and  the  results  are  shown  in 
the  experiments  subsection. 

7.1  Experiments 

Firstly,  we  describe  the  intuition  behind  OLS  algorithm  which  is  presented  in  [6].  They  prove 
that  the  expected  regret  of  this  algorithm  is  in  0(d 2  log^  d.  log  T). 

7.1.1  OLS 

In  this  algorithm,  they  use  OLS  estimator  to  find  the  estimated  parameter  for  each  arm  i.  Let 
X{  e  Rn'd  be  the  contexts  in  samples  that  arm  i  has  been  played  and  Y,  e  R"  be  the  rewards  of 
those  samples.  Then  the  estimated  parameter  fixiY,  's  defined  as  follows: 

PiXtYt  =  (XjX^XjYi 

They  use  two  kind  of  samples  for  each  arm:  forced  samples  and  all  samples.  The  forced  sample 
of  each  arm  are  determined  by  the  algorithm  upfront  (before  playing  any  arms)  and  all  sample 
contains  all  the  time  steps  that  arm  has  been  played  so  far.  The  algorithm  uses  forced  samples  of 
each  arm  to  choose  a  subset  of  arms  which  has  ’’high  enough”  estimated  reward  at  this  time  (the 
reward  is  obtained  by  using  the  OLS  estimator  of  the  forced  samples  of  each  arm  and  the  context 
of  this  step).  Then,  it  uses  all  samples  of  each  arm  and  play  the  arm  in  that  subset  with  the  highest 
estimated  reward  at  this  time  step. 

7.1.2  Results 

7.1.3  Results 

•  Simulation  1:  The  dimension  of  the  context  vector  at  each  time  step  is  6  and  each  of  the 
entries  at  each  time  step  is  generated  uniformly  at  random  (the  entries  are  independent  of 
each  other). 

There  are  6  different  arms  at  this  simulation  and  the  parameter  of  each  arm  is  generated 
uniformly  at  random  at  the  beginning  of  the  simulation.  We  run  the  simulation  for  10  times 
and  the  following  plot  shows  the  average  regret  of  the  different  methods  for  different  time 
horizons. 
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Figure  6:  Simulation  1 

•  Simulation  2:  The  dimension  of  the  context  vector  at  each  time  step  is  6  and  each  of  the 
entries  at  each  time  step  is  generated  uniformly  at  random  (the  entries  are  independent  of 
each  other). 

There  are  6  different  arms  at  this  simulation  and  the  parameter  of  arms  are  as  follows:  /u o  = 

{1, 1,-1, -1,0, 0};W  =  {1,  -1,1, -1,0,0};^  =  {-1,-1, -1,0,1,  =  {1, -1,0,0, -1, 1};/* 
{0,0, -1,-1, 1,1};  ns  =  {0.1, 0.1, 0.1, 0.1, 0.1, 0.1} 

We  run  the  simulation  for  10  times  and  the  following  plot  shows  the  average  regret  of  the 
different  methods  for  dilferent  time  horizons. 


|  cut  -  b«H  wj»| 


Figure  7:  Simulation  2 
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Simulation  3:  The  dimension  of  the  context  vector  at  each  time  step  is  6.  Each  entry  is 
generated  independently  from  a  normal  distribution  at  each  time  step.  The  average  of  the 
normal  distribution  of  each  dimension  is  generated  uniformly  at  random  at  the  beginning. 
The  variance  of  the  distribution  is  also  generated  uniformaly  at  random  (between  0  and  0.5). 

There  are  6  different  arms  at  this  simulation  and  the  parameter  of  arms  are  as  follows:  /uo  = 

{1, 1,  -1,-1, 0, 0};  p r  =  {1,  -1, 1,  -1, 0, 0};  ^  =  {-1,  -1,  -1, 0, 1, 1};  =  (1,  -1, 0, 0,  -1, 1};  m 

{0, 0,  -1,-1, 1, 1};  ^  =  {0.1, 0.1, 0.1, 0. 1, 0.1, 0.1} 

We  run  the  simulation  for  10  times  and  the  following  plot  shows  the  average  regret  of  the 
different  methods  for  different  time  horizons. 


|  —  OLS  —  Thompson  best  single  | 

Figure  8:  Simulation  3 


Simulation  4:  The  dimension  of  the  context  vector  at  each  time  step  is  6.  Each  entry  is 
generated  independently  from  a  normal  distribution  at  each  time  step.  The  average  of  the 
normal  distribution  of  each  dimension  is  generated  uniformly  at  random  at  the  beginning. 
The  variance  of  the  distribution  is  also  generated  uniformaly  at  random  (between  0  and  0.5). 

There  are  6  different  arms  at  this  simulation  and  the  parameter  of  each  arm  is  generated 
uniformly  at  random  at  the  beginning  of  the  simulation.  We  run  the  simulation  for  10  times 
and  the  following  plot  shows  the  average  regret  of  the  different  methods  for  different  time 
horizons. 

We  run  the  simulation  for  10  times  and  the  following  plot  shows  the  average  regret  of  the 
different  methods  for  different  time  horizons. 
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| —  OLS  —  Thompson  best  single  | 

Figure  9:  Simulation  4 


8  Open  Problems 

We  have  introduced  the  leaky  bucket  input  model.  The  definition  is  as  follows:  for  each  time  t 
and  each  arm  i  with  parameters  (r;,  <3j)  and  (c; ,  6;)  the  total  reward  and  cost  of  that  arm  in  the  first 
t  time  steps  should  be  in  [r;f  —  and  [cjf,  c(t  +  6],  respectively.  We  proposed  a  simple  policy 

for  the  case  that  a;  and  bi  are  constant  values.  Extending  this  policy  to  the  case  that  <Sj  and  t;  are 
functions  of  time  is  an  open  question.  In  this  case,  it  seems  that  for  the  arms  that  we  sample  later 
we  need  more  samples.  It  is  obvious  that  the  problem  cannot  be  solved  for  any  function  because  it 
includes  the  fully  adversarial  setting.  Specifying  the  functions  that  a  policy  can  compete  against 
the  single  best  arm  for  this  model  is  a  possible  future  line  of  work. 

Another  important  future  line  of  work  is  the  Markovian  input  model.  We  provide  the  Recency 
algorithm  and  compare  it  with  the  UCB-Simplex  by  some  experiments.  Providing  a  theoretical 
analysis  for  this  algorithm  or  for  another  algorithm  showing  that  the  regret  guaranteed  to  be  low  is 
an  important  future  direction  for  this  problem.  Also  comparing  the  two  algorithms  with  real  data 
of  a  changing  environment  might  be  helpful  to  decide  which  of  them  works  better  in  real  situations. 
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Abstract 

1  While  PAC  algorithms  for  reinforcement  learning  provide  strong  theoretical  guar- 

2  antccs.  their  high  sample  complexity  has  thus  far  prevented  their  use  in  practical 
applications.  Transfer  learning  has  the  potential  to  reduce  the  sample  complexity 

a  of  learning  by  reusing  samples  across  the  state- action-M DP  space,  yet  few  PAC 

5  algorithms  exist  that  take  advantage  of  transfer  learning.  We  introduce  a  unified 

6  approach  to  inter-  and  intra-task  knowledge  transfer  with  PAC  guarantees  which, 

7  under  appropriate  conditions,  leads  to  a  sample  complexity  reduction  that  is  expo- 

e  nential  with  respect  to  the  dimensionality  of  the  state- action-M  DP  space.  We  show 

9  that  approximate  local  linearity  is  sufficient  (but  not  necessary)  for  our  algorithm 

10  to  yield  dramatic  sample  complexity  reductions  over  the  current  state  of  the  art  in 

11  PAC  reinforcement  learning.  In  addition  we  show  that,  in  the  batch-mode  learning 

12  setting,  pessimism  in  the  face  of  uncertainty  can  offer  significant  advantages  over 

13  optimistic  or  maximum-likelihood  estimates. 


i4  1  Introduction 

is  Reinforcement  learning  (RL)  is  suffering  from  a  disconnect  between  theory  and  practice.  Algorithms 

16  with  strong  theoretical  guarantees  do  not  perform  well  in  practice  [4],  while  the  algorithms  that 

17  perform  best  in  practice  offer  no  theoretical  guarantees  (in  fact  counter-examples  for  many  of  these 
ie  algorithms  exist,  which  show  that  they  can  have  very  poor  worst-case  performance  [13]). 

19  Probably  approximately  correct  (PAC)  algorithms  for  RL  are  a  class  of  algorithms  that  has  recently 

20  garnered  significant  attention.  They  offer  a  high  probability  guarantee  (probably),  that  they  will 

21  produce  a  policy  that  performs  almost  as  well  (approximately)  as  the  optimal  policy  (correct),  while 

22  requiring  a  number  of  samples  that  is  at  most  polynomial  to  the  parameters  of  the  problem.  The 

23  polynomial  guarantee  is  with  respect  to  1)  the  size  of  the  state-action-MDP  space,  2)  the  planning 

24  horizon,  3)  how  far  can  the  approximate  policy  deviate  from  the  optimal  policy,  and  4)  the  probability 

25  that  the  algoritlim  will  fail  to  produce  an  acceptable  policy  given  a  sufficient  number  of  samples. 

2G  Despite  the  significant  step  forward  that  the  polynomial  sample  guarantees  of  PAC-RL  algorithms 
27  represent  when  compared  to  the  guarantees  provided  by  their  predecessors  (asymptotic  guarantees 
2e  where  convergence  is  only  guaranteed  in  the  limit  of  infinite  samples),  their  sample  requirements 
29  have  proven  to  be  too  steep  for  practical  applications.  Lower  bound  results  [22,  10,  1]  indicate  that 
so  providing  PAC  guarantees  is  inherently  sample- intensive  in  the  worst,  case.  Recent  research  has 

31  focused  on  identifying  classes  of  “well-behaved”  problems,  for  which  far  fewer  samples  suffice  to 

32  provide  PAC  guarantees  [  19]. 

33  Ihe  size  of  the  state-action-MDP  space  is  by  far  the  largest  contributor  to  the  sample  complexity 

34  of  RL.  As  the  number  of  state,  action,  and  MDP  variables  increases,  the  size  of  the  space  can  grow 

35  exponentially  with  the  number  of  dimensions.  The  problem  is  even  more  pronounced  in  the  case  of 
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36  continuous  spaces.  Inter  and  intra-task  knowledge  transfer  aims  to  reduce  the  effective  size  of  the 

37  state-action-MDP  space  by  reusing  samples  across  state-actions  in  separate  MDPs  or  within  the  same 

38  MDP  respectively. 

39  To  the  best  of  our  knowledge,  only  one  algorithm  exists  that  can  take  advantage  of  intra-task 
to  knowledge  transfer  and  maintain  PAC  guarantees  under  appropriate  conditions  [2].  Similarly,  only 
«  one  algorithm  is  known  to  exist  that  addresses  inter-task  transfer  with  PAC-guarantees  [18],  We 
«  present  a  general  transfer  learning  algorithm  for  which  the  two  previous  known  results  are  special 
<3  cases.  This  new  algorithm  can  take  advantage  of  both  inter  and  intra-task  knowledge  transfer  without 

44  having  to  differentiate  between  the  two,  and  is  able  to  maintain  PAC  guarantees  under  more  general 

45  conditions  than  its  predecessors.  An  approximate  local  linearity  assumption  on  the  dynamics  of  the 

46  process,  is  a  sufficient  (but  not  necessary)  condition  for  our  algorithm  to  yield  sample  complexity 

47  reductions  that  arc  exponential  with  respect  to  the  dimensionality  of  the  state-action-MDP  space. 

48  For  simplicity  of  exposition  we  will  show  how  our  approach  to  transfer  learning  can  be  used  to 

49  reduce  sample  complexity  in  the  batch  mode  learning  setting  (algorithms  in  this  setting  learn  from 
so  a  fixed  set  of  samples  and  perform  no  learning  during  the  policy  execution  or  testing  phase).  If 

51  we  combine  our  approach  with  recent  PAC  exploration  algorithms  [18, 14],  the  sample  complexity 

52  reduction  in  the  exploration  setting  is  identical  to  the  improvement  in  the  batch  mode  learning  setting. 

53  Additionally,  we  will  assume  that  the  task  variation  is  observable.  Our  algorithm  can  be  used  with 

54  minor  modifications  as  a  subroutine  to  improve  the  sample  complexity  of  recent  task  identification 

55  algorithms  [3, 14]. 

56  Optimism  in  the  face  of  uncertainty  has  become  the  norm  for  exploration  algorithms,  and  maximum 

57  likelihood  estimation  has  long  been  the  norm  for  batch-mode  learning  algorithms.  By  contrast,  our 

58  algorithm  employs  pessimism  in  the  face  of  uncertainty.  We  show  that  in  the  context  of  batch-mode 

59  learning,  pessimism  in  the  face  of  uncertainty  can  offer  significant  advantages  over  optimistic  or 
eo  maximum-likelihood  estimates. 


61  2  Background,  notation  and  delinitions 

62  In  the  following,  important  symbols  and  terms  will  appear  in  bold  when  first  introduced.  Let  X  be  the 

63  domain  of  x.  Throughout  this  paper,  v  x  will  serve  as  a  shorthand  for  Vx  g  X.  We  will  use  x,  x,  x,  x' 

64  to  denote  time-indexed  state-action-MDP  triples,  where  xn,  x„xa,xm  denote  the  time-step,  state, 

65  action,  and  MDP  components  of  x. 

66  A  Markov  decision  process  (M  DP)  family  is  a  7-tuple  (5,  A,  M,  P,  R,  7,  //),  where  S  is  the  state 

67  space  of  the  process,  A  is  the  action  space,  M  is  a  family  of  MDPs,  P  is  a  Markovian  transition  model 

68  (p(x'  |  x)  denotes  the  probability  density  of  a  transition  to  state  x'  when  starting  from  state-action- 

69  MDP  triplex  at  time  xa,  is  a  reward  function  (ft(x.x')  is  the  reward  for  transitioning  to  state  x' 

70  from  x),  7  e  [0, 1]  is  a  discount  factor  for  future  rewards,  and  H  is  a  horizon  time  for  each  episode. 

71  In  this  paper  we  will  be  considering  time  dependent  transition  and  reward  models.  A  deterministic 

72  policy  tt  for  an  MDP  is  a  mapping  x  :  S,  h  t->  A  from  states  and  time-steps  to  actions;  ir(x,,  /t) 

73  denotes  the  action  choice  in  state  x5  at  stepfc.  The  value  Q7'  (x)  of  a  state-action-MDP  triple  x  at  step 

74  x/,  under  policy  ir  is  defined  as  the  expected,  accumulated,  discounted  reward  from  x  at  step  It,  when 

75  all  decisions  starting  from  step  h  +  1  are  made  according  to  policy  7r.  There  exists  an  optimal  policy 

76  ir*  for  choosing  actions  which  yields  the  optimal  value  function  Q*  (x)  —  max,  Q ;;  (x) .  For  a  fixed 

77  policy  77  the  Bellman  operator  for  Q  is  defined  as:  R *  Q(x)  fx,  p(x'|x,  7r)  |  R ( x ,  x' )  +  7Q(x')  \ . 

78  The  value  of  a  state  s  at  step  h  under  policy  7r  in  MDP  m  is  defined  as  the  expected,  accumulated, 

79  discounted  reward  from  s  in  MDP  m  at  step  h,  when  all  decisions  starting  from  step  h  arc  made 

80  according  to  policy  tt,  V7 (s, m,  h)  =  J  P(x\xs  =  s,xm  =  in.  x!t  =  h,tt^)Q’ri  (x).  We  also 
ei  define  V*(s,m,h)  —  max,  V*(s,m,h). 

82  In  reinforcement  learning  [23],  a  learner  interacts  with  a  stochastic  process  modeled  as  an  MDP  and 

83  observes  the  state  at  every  step,  however  the  transition  model  P  is  not  known.  The  goal  is  to  learn  a 

84  near  optimal  policy  using  experience  collected  through  interaction  with  the  process.  At  each  step 

85  of  interaction,  the  learner  observes  the  current  state  x,  and  MDP  xm,  chooses  an  action  x„,  and 

86  observes  the  resulting  next  state  x' ,  essentially  sampling  the  transition  model  of  the  process.  Thus 
e?  experience  comes  in  the  form  of  (x,  x')  samples. 
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88  For  simplicity  of  exposition  we  w  ill  assume  that  the  reward  function  is  known.  Our  results  easily 

89  generalize  to  the  unknown  reward  setting.  We  will  also  assume  that  all  rewards  are  bounded,  and 

90  without  loss  of  generality  that  they  lie  in  [0,  /{max]1  We  will  use  Qmax  to  denote  ma x*^  Q*  (x) 

91  (the  maximum  expected  discounted  reward  for  any  policy  from  any  state- action-M  DP  triple  at  any 

92  timestep),  and  define  Qmax  —  ^max  +  yQ  max* 

93  Definition  2.1.  The  sample  set  S  is  defined  to  he  the  set  of  all  samples  retained  by  our  algorithm. 

94  Definition  2.2.  d(x,x)  is  defined  to  be  the  distance  of  (x,,xtt,xm,x/j)  to  (xtf,xa,  xm,  x&)  in 

95  some  well-defined  distance  metric.  We  also  define  the  shorthand  d(x,  x,  c)  max{0,<i(x,£ ) 

96  c}  where  c  is  a  constant,  dfs(x,  S)  the  distance  of  the  k-th  nearest  sample  in  S  to  x,  and 

97  dk(xt  S,  c )  —  max{0,  dk(x,  S)  —  c}.  When  an  insufficient  number  of  samples  exists  to  compute  the 

98  last  two  functions,  they  return  infinity. 

99  Examples  of  distance  metrics  include  weighted  norms,  norms  on  linear  and  non  linear  transformations, 

100  as  well  as  norms  on  features  of  the  domain  space.  In  the  following  we  will  use  Q  to  denote  the 

101  approximate  value  function  computed  by  our  algoritlmi,  and  7r^  die  greedy  policy  over  Q.  followed 

102  by  our  algorithm. 

103  3  A  motivating  example 

104  We  will  use  an  autonomous  race-car  agent  as  an  example  of  how  samples  (knowledge)  can  be  shared 

105  across  different  parts  of  the  space,  both  within  the  same  task  (intra-task  transfer),  and  among  different 

106  tasks  (inter-task  transfer).  The  agent  can  be  presented  with  multiple  variations  of  a  race  car  problem: 

107  1)  Different  cars  (four  wheel  drive,  front  wheel  drive,  rear  wheel  drive),  2)  different  tracks  (gravel, 

108  dirt,  clay,  asphalt),  and  3)  different  driving  conditions  (sun,  rain,  snow). 

109  Consider  the  effects  of  applying  the  brakes  for  0.3  seconds  when  traveling  at  100  km/h.  Suppose  that 
no  the  speed  of  die  vehicle  after  applying  the  brakes  is  80  km/h.  It  is  reasonable  to  assume  that  if  we 
in  apply  the  brakes  at  similar  speeds,  say  99.5  and  100.5  hm/li,  the  effects  will  be  similar.  How  different 

112  the  starting  state  can  be  before  the  similarity  assumption  starts  to  break  down  strongly  depends  on 

113  how  we  define  the  effects  of  an  action.  In  the  simplest  case,  the  effect  of  an  action  at  a  particular  state 

114  is  die  state  diat  results  from  taking  that  action.  In  odier  words,  we  are  saying  that  the  car’s  speed 

115  after  applying  die  brakes  when  starting  at  99.5,  100,  and  100.5  km/h  should  be  about  80  km/h.  One 

116  obvious  way  to  make  this  assumption  true  over  a  larger  range  of  values  is  to  think  of  an  action  as 

117  a  transformation.  Instead  of  saying  that  the  effect  of  applying  die  brakes  at  100  km/h  is  a  resulting 
ns  speed  of  80  km/h,  we  can  say  that  the  effect  of  applying  the  brakes  at  100  km/h  is  a  reduction  in 

119  speed  by  20  km/h.  The  range  of  values  for  which  this  transformation  will  be  sufficiently  accurate 

120  will  depend  on  how  closely  this  system  resembles  a  linear  system.  For  a  perfectly  linear  system  the 

121  transformation  will  be  accurate  over  die  entire  range  of  values.  Note  that  transformations  arc  not 

122  required  to  be  linear  functions.  For  example,  one  could  define  a  saturating  transformation:  Applying 

123  the  brakes  when  starting  from  positive  speed  cannot  result  in  negative  speed. 

124  Previous  work  on  inter-task  transfer  has  focused  on  transferring  value  functions  [1 8],  policies  [24],  or 

125  complete  sample  sets  [3, 14].  While  all  the  above  strategies  can  be  effective  in  certain  situations,  diere 

126  exist  very  simple  examples  where  they  fail.  Consider  the  case  where  the  autonomous  car  agent  has 

127  to  complete  races  in  the  same  track  on  three  different  days.  On  the  first  die  weather  is  mild.  On  the 

128  second  the  track  is  covered  with  ice.  On  the  third,  the  entire  track  is  clear  of  ice  except  for  a  particular 

129  turn.  While,  up  to  the  point  of  die  ice-covered  turn,  the  dynamics  of  the  diird  task  are  identical  to 

130  the  dynamics  of  the  first,  entering  the  turn  with  the  same  speed  that  the  agent  was  entering  it  on  the 

131  first  might  cause  the  car  to  lose  control.  Both  the  value  function  and  policy  of  the  first  task  are  not 

132  transferable  to  the  third,  even  on  segments  of  the  track  where  these  two  tasks  are  identical.  The  value 

133  function  and  policy  of  the  second  task  are  similarly  problematic  since,  even  in  the  best  case,  they 

134  are  too  conservative  for  the  third  task,  and,  in  the  worst  case,  they  are  using  drifting  techniques  that 

135  will  not  work  on  clear  segments  of  the  track.  While  the  problem  is  trivially  solvable  by  transferring 

136  samples  from  the  first  task  for  clear  segments  and  samples  from  the  second  task  for  the  icy  segment, 

137  even  PAC  transfer  learning  methods  that  directly  transfer  samples  do  not  have  this  flexibility.  Existing 

138  methods  only  allow  using  all  samples  gathered  from  a  particular  MDP  (or  cluster  of  MDPs)  or  none 

139  at  all  (a  notable  exception  in  the  non-PAC  setting  is  the  work  of  Lazaric  et.  al.  [1 1]). 

lIt  is  easy  to  satisfy  this  assumption  in  all  MDPs  with  bounded  rewards  by  simply  shifting  the  reward  space. 
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140  4  Inter  and  intra-task  knowledge  transfer 

141  Both  inter  and  intra-task  transfer  can  be  performed  via  a  transfer  function:2 

14?  Definition  4.1.  A  transfer  function  /tr(x,x^,x)  ->  x!t  is  a  function  that  takes  as  input  a  source 

143  state-action-MDP  triple,  an  associated  next  state,  and  a  target  state-action-MDP  triple,  and  outputs 

144  a  predicted  next  state  for  the  target  state-action-MDP  triple. 

145  TWo  straightforward  examples  of  transfer  functions  are  1)  the  identity  transfer  function 

146  and  2)  die  relative  transfer  function  /*•(&»£*»£)  — ►  x't  +  xs  —  x8 .  Signif- 

147  icantly  more  complex,  non-linear,  domain  specific  transfer  functions  can  be  defined,  including 

148  transfer  functions  that  translate  functions  between  MDPs  with  different  state  and  action  spaces.  We 

149  are  using  the  identity  and  relative  transfer  functions  as  examples  because  they  are  simple  and  intuitive, 
iso  and  as  we  will  see  in  the  related  work  section,  much  of  existing  work  in  inter  and  intra-task  transfer 

151  learning  has  implicitly  or  explicitly  used  these  two  transfer  functions. 

152  4.1  The  transfer  learning  algorithm 

153  Definition  4.2.  The  cover  set  C(S,m)  is  defined  to  be  the  (possibly  infinite)  set  of  x  :  xm  —  m  for 

154  which  minJ€^{ci(x,x)}  <  QmBX. 

155  Hie  cover  set  is  the  portion  of  the  state-action  space  for  which  at  least  one  sample  exists  less  than 
Qmm  away. 

157  Definition  4.3.  The  discretization  set  D(S,m,dQ)  is  defined  as  a  discretization  of  the  cover  set 

158  C(S,m),  such  that  for  every  x  €  C(S,  m)  there  exists  £  €  D(S,m,dq)  such  that  d(x,  f)  <  dQ. 

159  We  also  define  the  shorthand d(z,  D(S ,m,dQ))  =  min *)}• 

160  We  will  use  D(S,  m ,  dQ)  |  to  denote  the  number  of  elements  in  D(S,m,  dQ)  for  which  at  least  one 

161  sample  in  S  exists  less  than  Qmax  away.  As  wc  will  see  in  our  analysis,  our  algorithm's  computational 

162  complexity  depends  on  |D(5,m,dQ)|  (which  is  finite),  even  if  die  cardinality  of  the  state-action 

163  space  is  infinite.  The  best  way  to  construct  D(S,  m,dQ)  depends  on  the  application.  The  simplest 

164  implementation  is  to  find  the  extreme  values  for  every  state  and  action  variable  in  C(S}  m)  (this  can 

165  be  easily  performed  given  S  and  m)  and  allocate  a  dense  multidimensional  array  covering  all  values 

166  within  that  space  with  appropriate  step  sizes  to  make  sure  that  the  radius  of  each  element  is  less  than 

167  or  equal  to  (Iq.  ITte  advantage  of  this  structure  is  diat  it  offers  0(1)  lookups.  Hie  disadvantage  is 
166  that  if  the  volume  covered  by  C(5,  m)  is  significantly  smaller  than  the  volume  contained  within  the 

169  extreme  values,  the  dense  scheme  could  lead  to  inefficient  memory  usage.  A  sparse  representation 

170  (see  for  example  the  representation  scheme  employed  by  Pazis  and  Parr  [18))  is  space  efficient,  but 

171  can  have  significantly  worse  lookup  performance  (0(|0(5,7ri,dg)|)  for  a  naive  implementation). 

172  Given  a  set  of  samples  and  target  MDP  m,  algorithm  1  generates  a  discretization  set  (line  2)  and 

173  uses  it  to  compute  a  pessimistic  approximate  value  function  (lines  3-7).  It  dien  follows  the  greedy 

174  policy  over  the  computed  value  function  (lines  8-10).  Although  algorithm  1  is  tailored  to  the  episodic 

175  setting,  it  can  be  easily  extended  to  file  infinite  horizon  setting. 

176  4,2  Representation  and  transfer  bias 

177  Definition  4.4.  Given  a  distance  function  d(),  transfer  function  flr,  and  constants  dq  and  dtr, 

178  ec  >  0  is  the  minimal  non -negative  constant  satisfying  \B*Q(x)  —  BnQ(x)\  <  ec  +  d(x,  i.dq), 

179  V(x,x)  :  xm  =  xm,  and  for  it  €  {7r*,7r^},  and 

+  7 Q(*'))  “  p(x; =  ftr(x,  x„  x))(R(xt  i')  +  7 Q(x')) 

<  €«  +  d(x,X,dtr)fV(x,X,a,£),  it  €  {ir*, 

"This  definition  of  a  transfer  function  is  specific  to  our  algorithm,  and  is  not  meant  to  subsume  similar 
concepts  such  as  inter- task  mappings  [25]. 
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Algorithm  1  PAC  Transfer 

1:  Inputs:  MDPm,  horizon  H>  sample  set  S,  number  of  neighbors  k ,  transfer  function  flr , distance 
function  t/(),  and  constant  dq . 

2:  Generate  discretization  set  b(S,m,dq),  and  set  Q(x)  ^  0  Vx  €  Z>(S,ra,  dq). 

3:  for  h  =  H  —  1  to  0  do 
4:  for  x  €  D(S,m,dq )  do 

5:  0(x)  =  |  ( max{0,  maxfl/  { R(x,x'f i4)  +  7 <3(x<)  -  d(x,®<)}})  where  x'iya  —  = 

m,£ii9  =  ftr(xi,aitx ),  and  ( xtt  x[ta)  is  the  i-th  sample  returned  by  nnk(x,  8). 

6:  end  for 

7:  end  for 

8:  for  h  =  0  to  H  —  1  do 

9:  Perform  action  argmax*a  Q(xh),  where  xs  is  the  state  at  step  h  and  xq  is  the  starting  state. 

1 0:  end  for 

11:  function  nn/t(x,  5) 

12:  return  the  k  nearest  samples  to  x  in  S. 

13:  end  function 
14:  function  Q(x) 

15:  if  d(x,  L)(S,m,dq))  >  dQ  then 
16:  return  0. 

17:  else 

18:  return  Q( arg  min d(x ,  D(S>  m,  dq)))  where  tics  arc  resolved  by  a  deterministic  function. 

1 9:  end  if 
20:  end  function 


iso  €c  expresses  the  bias  of  algorithm  1 .  The  first  source  of  bias  stems  from  the  fact  that  algorithm  1 

181  (or  any  other  algorithm  with  finite  computational  requirements),  can  only  represent  functions  of 

182  finite  complexity.  'Hie  slower  the  Bellman  operator  changes  with  respect  to  the  chosen  distance 
188  function,  the  lower  the  bias  due  to  the  fact  that  algorithm  1  is  using  a  finite  set  of  points  to  represent  a 

184  continuous  function.  This  is  captured  in  the  first  part  of  definition  4.4. 

185  The  second  source  of  bias  is  transfer.  Transfer  learning  allows  us  to  reduce  sample  complexity 
18$  (variance)  at  the  expense  of  introducing  bias.  When  the  transfer  function  is  able  to  accurately  model 

187  the  dynamics  in  remote  parts  of  the  state-action-MDP  space  based  on  samples  from  other  areas  of  the 

188  state-action-MDP  space,  transfer  bias  will  be  low.  Conversely,  when  the  transfer  function  is  unable  to 

189  accurately  model  the  dynamics  in  one  part  of  the  state-action-MDP  space  based  on  samples  from 

190  other  areas  of  the  state-action-MDP  space,  transfer  bias  will  be  high.  'Phis  is  captured  in  the  second 

191  part  of  definition  4.4.  The  first  integral  is  the  (exact)  Bellman  operator  for  x  applied  to  the  value 

192  function  produced  by  our  algorithm,  while  the  second  integral  is  the  Bellman  operator  for  another 

193  statc-action-MDP  triple  x ,  but  with  the  next  state  replaced  by  the  value  returned  by  the  transfer 

194  function  from  x  to  x. 

195  When  the  bias  introduced  by  transfer  learning  is  unacceptably  high,  we  have  what  is  commonly 

196  referred  to  as  negative  transfer.  Rather  than  thinking  of  negative  transfer  as  a  binary  phenomenon 

197  (present/not  present),  definition  4.4  allows  us  to  quantify  its  contribution.  Unless  the  chosen  transfer 

198  function  is  able  to  describe  the  changes  in  the  dynamics  across  the  state-action-MDP  space  perfectly 

199  (as  would  happen  for  example  if  the  relative  transfer  function  was  used  for  transfer  in  a  linear  system), 

200  transfer  learning  will  introduce  some  bias.  Similarly,  if  file  state-action-MDP  space  is  continuous, 

201  using  a  finite  set  of  points  to  represent  the  value  function  will  introduce  some  bias.  Given  a  fixed 

202  distance  and  transfer  function,  the  extent  of  the  bias  can  be  managed  by  adjusting  dlr  and  dq. 

203  Note  that  dQ  and  dtr  are  user  defined  constants  (in  other  words  the  user  decides  the  computational 

204  and  sample  complexity  of  the  algorithm).  Setting  dtr  and  dQ  needs  to  take  into  account  policy 

205  performance  as  well  as  sample  and  computational  complexity.  As  we  will  see  in  our  analysis,  the 

206  sample  complexity  of  our  algorithm  is  strongly  dependent  on  dtT,  while  the  computational  complexity 

207  is  dependent  on  dq.  The  larger  dtr  and  dq  are,  the  lower  our  algorithm’s  sample  and  computational 

208  complexity  respectively.  On  the  other  hand,  the  performance  of  the  policy  produced  by  our  algorithm 

209  is  adversely  affected  by  file  bias  €c.  Definition  4.4  suggests  that  dtr  and  dq  should  be  set  such  that 

210  the  two  sources  of  bias  arc  balanced. 
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211  While  knowing  the  bias  ec  is  not  necessary  for  the  execution  of  algorithm  1 ,  our  performance  bounds 

212  depend  on  ec.  A  good  choice  of  distance  function  will  allow  d,Q  to  be  large  without  forcing  €c  to  be 

213  large,  allowing  for  good  policy  performance  and  low  computational  complexity.  A  good  choice  of 

214  transfer  function  will  allow  dtr  to  be  large  without  forcing  cc  to  be  large,  allowing  for  good  policy 

215  performance  and  low  sample  complexity. 


216  4.3  PAC  guarantees 

217  Definition  4.5.  A  policy  cover  is  defined  as  a  possibly  infinite  set  of  state-action  pairs 

218  in  MDP  ro  such  that  the  expected  number  of  state-actions  outside  the  set  encountered  in  H  steps 

219  when  starting  from  state  s  and  following  policy  it  is  bounded  above  by  q(  . 

220  Theorem  4.6  below  is  the  main  theorem  of  this  paper.  Given  a  sample  set  of  arbitrary  distribution,  it 

221  gives  us  a  guarantee  that  as  long  as  there  exists  a  near-optimal  policy  for  MDP  m  covered  sufficiently 

222  well  by  the  sample  set,  algorithm  1  will  perform  well  in  m  with  high  probability.  The  samples  can  be 

223  samples  collected  from  to,  from  MDPs  similar  to  to,  or  any  combination  of  the  above. 

224  Theorem  4.6.  Let  s  be  the  starting  state  for  MDP  to,  and  ec  be  de- 

225  fined  as  in  definition  4.4.  If  d(x,  x)  >  QmaxVx,  S  i/,  k  > 

226  a/jrf  jftere  exists  a  policy  it"  with  a  policy  cover  Pjj  (s,to,«*)  such 

227  that  V *  (s,to, 0)  >  ^‘(sjTO,  0)  —  €v,  and  dft{x,S)  <  dtr  V  x  6  P&'(s  ,my€v),  then 

228  V*Q  ($,  to,  0)  >  V*  (s,  to,  0)  -  e*  -  (c,  +  2 ec)  with  probability  at  least  1  6.  where  n®  is  the 

229  policy  followed  by  algorithm  J.  and  £  =  H  if  y  =  l  or  £  =  min  -J  H,  [•  otherwise. 

230  The  requirements  for  theorem  4.6  to  hold  are  much  more  relaxed  than  the  conditions  of  similar 

231  bounds  [17].  No  assumptions  are  made  about  the  distribution  of  samples,  the  entire  space  for  MDP 

232  to  is  not  required  to  be  well  covered,  and  even  the  true  optimal  policy  docs  not  need  to  be  covered. 

233  The  fact  that  the  computation  of  Q  is  pessimistic  ensures  that  as  long  as  a  policy  with  acceptable 

234  performance  that  is  well  covered  exists,  algorithm  1  will  perform  well  with  high  probability.  This  is  a 

235  departure  from  previous  PAC  algorithms  that  require  a  sufficient  number  of  nearby  samples  for  every 

236  state-action. 

237  A  brute  force  approach  to  ensuring  that  algorithm  1  will  perform  well  with  high  probability  on  any 

238  MDP  in  the  family  of  MDPs  we  arc  considering,  is  to  guarantee  that  the  entire  space  of  the  family 

239  is  covered  (has  a  sufficient  number  of  samples  within  dtT  distance).  Corollary  4.8  follows  from 

240  theorem  4.6  and  gives  an  upper  bound  on  how  many  samples  are  sufficient  in  this  scenario. 

241  Definition  4.7.  The  covering  number  MsAM^tr)  of  a  state-action-MDP  space  is  the  cardinality 

242  of  the  largest  minimal  set  C  of  state-action-MDP  triples,  such  that  for  any  x  reachable  from  the 

243  starting  state(s)  of  any  in  <G  M,  there  exists  x  €  C  such  that  d(x,  x)  <  dtr. 

244  Corollary  4.8.  Let  ec  be  defined  as  in  definition  4.4,  d(x,x)  >  QmaxVi,x  :  xh  /  Xh,  and  k  = 

245  ^£3*  In  Given  at  least  Ms  am  (dtr )  %^2L  In  samples  uniformly  spaced 

*  $ 

246  across  the  state-action-MDP  space.  V*  (s,m,0)  >  K*  (s,  m,0)  -  2f  (ca  +  2ec)  with  probability 

247  1  —  4  for  any  (5,  to),  where  it®  is  the  policy  followed  by  algorithm  /,  and  £  =  H  if  7  =  1  or 

248  £  =  min  j  //,  j>  otherwise. 

249  While  the  cardinality  of  the  discretization  set  appears  in  our  bounds,  its  influence  is  only  logarithmic 

250  (which  is  exponentially  smaller  than  the  log-linear  dependence  of  other  PAC  bounds).  Corollary  4.8 

251  is  of  interest  for  two  reasons.  The  first  is  that  it  gives  an  upper  bound  on  how  many  samples  one 

252  would  need  to  generate  given  a  generative  model.  The  second  is  that  the  sample  complexity  of 

253  PAC  exploration  algorithms  scales  with  the  number  of  uniformly  spaced  samples  required  before 

254  good  performance  can  be  guaranteed  with  high  probability.  Since  MsAM^tr)  can  be  exponentially 

255  smaller  in  the  dimensionality  of  the  space  than  Msam{^q)>  and  algorithm  1  scales  with  Ms  am  {dtr) 

256  rather  than  MsAM(dQ),  the  same  improvement  can  be  achieved  in  the  exploration  setting. 
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257  5  Discussion 

258  The  sample  complexity  of  modem  PAC  RL  algorithms  is  log-linear  with  respect  to  the  size  of  the 

259  space.  In  continuous  or  discrete  spaces  for  which  a  distance  metric  exists,  the  effective  size  of  the 

260  space  is  strongly  dependent  on  the  desired  quality  of  approximation.  The  size  of  the  space  is  expressed 

261  by  the  covering  number,  the  number  of  hypcrballs  required  to  completely  cover  the  statc-action-MDP 

262  space  (see  definition  4.7).  Better  approximation  quality  requires  hyperballs  of  smaller  radius.  As  the 

263  radius  of  each  hyperball  decreases,  the  covering  number  can  grow  exponentially  to  the  dimension ality 

264  of  the  space.  In  effect,  the  sample  complexity  of  PAC  algorithms  for  spaces  with  a  distance  metric 

265  grows  exponentially  to  the  dimensionality  of  the  space.  Algorithm  1  can  lead  to  exponential  sample 

266  complexity  reduction  with  respect  to  the  dimensionality  of  the  state-action-MDP  space.  The  majority 

267  of  existing  PAC  RL  algorithms  make  no  distinction  between  the  radius  of  the  hyperball  where  each 

268  sample  can  be  used  to  compute  tlie  value  function,  and  the  radius  of  the  hyperballs  where  the  value 

269  function  has  constant  (or  smoothly  interpolated)  value  [16.  7, 18. 14].  Theorem  4.6  tells  us  that  as 

270  long  as  a  transfer  function  is  available  that  is  able  to  accurately  model  the  dynamics  in  remote  parts 

271  of  the  state-action-MDP  space  based  on  samples  from  other  areas  of  the  state-action-MDP  space 

272  the  radius  of  the  hyperball  where  each  sample  can  be  (rc)uscd  can  be  much  larger  than  the  radius  of 

273  constant  value  hyperballs  typically  used  in  PAC  algorithms.  This  can  lead  to  an  exponential  reduction 

274  in  sample  complexity  with  respect  to  the  dimensionality  of  the  statc-action-MDP  space. 

275  5.1  The  advantage  of  pessimism 

276  PAC  exploration  algorithms  compute  an  estimate  of  the  value  function  that  is  optimistic  with  respect  to 

277  uncertainty,  while  most  batch  mode  algorithms  compute  a  maximum  likelihood  estimate.  Optimistic 

278  and  maximum  likelihood  estimates  can  result  in  the  overestimation  of  the  value  of  poorly  covered 

279  suboptimal  policies.  As  a  result,  even  if  the  value  of  a  well  performing  policy  can  be  accurately 

280  estimated  from  a  sample  set,  optimistic  and  maximum  likelihood  estimates  can  result  in  the  selection 

281  of  a  poorly  covered  suboptimal  policy. 

282  In  many  real-world  situations  such  as  when  samples  have  been  collected  from  expert  demonstrations, 

283  one  can  have  access  to  samples  from  a  liigh  quality  policy  that  does  not  cover  the  entire  state-action 

284  space.  If  the  expert  acts  near-optimally,  it  is  likely  that  their  demonstrations  will  only  visit  a  subset  of 

285  the  state-action  space.  Low  value  state-actions  may  never  be  visited.  Note  that  the  “expert”  in  this 

286  case  does  not  need  to  be  human.  For  example  we  may  interested  in  taking  over  control  of  cooling  a 

287  datacenter  from  classical  control  algorithms,  in  order  to  improve  energy  efficiency  [6].  Given  the 

288  safety  constraints  of  such  a  system  it  is  unlikely  that  the  entire  state-action  space  will  be  covered 

289  wdth  samples,  yet  given  that  the  existing  policy  has  acceptable  performance,  an  acceptable  policy  is 

290  guaranteed  to  be  covered  by  the  sample  set.  Additionally,  in  many  domains  prior  knowledge  can  allow 

291  us  to  deduce  that  a  significant  portion  of  the  state-action  space  would  not  be  visited  by  a  near-optimal 

292  policy,  and  thus  does  not  need  to  be  sampled.  Pessimism  in  the  face  of  uncertainty  gives  algorithm  1 

293  a  significant  advantage  in  these  situations,  by  ensuring  that  the  value  of  poorly  covered  suboptimal 

294  policies  will  not  be  overestimated  (thus  those  suboptimal  policies  will  not  be  selected).  Rather  than 

295  requiring  the  entire  state-action  space  to  be  well  covered  with  samples,  algorithm  1  only  requires  that 

296  there  exists  an  adequately  covered  near-optimal  policy.  To  the  best  of  our  knowledge,  algorithm  1  is 

297  the  first  algorithm  for  which  the  existence  of  an  adequately  covered  near-optimal  policy  is  sufficient 

298  to  guarantee  approximately  optimal  performance  with  high  probability. 

299  5.2  Seleciing  distance  and  transfer  functions 

300  Similarly  to  how  feature  based  algorithms  require  a  set  of  features,  kernel  based  algorithms  require 

301  a  kernel  function,  and  deep  learning  algorithms  require  an  appropriate  architecture,  distance  based 

302  algorithms  require  an  appropriate  distance  function.  Considerable  progress  has  been  made  over  the 

303  past  few  decades,  and  automatic  discovery  of  distance  metrics  is  a  promising  area  of  research  [21, 26] 

304  with  deep  autoencoders  [81  achieving  impressive  results  on  areas  ranging  from  raw  images  [9],  to 

305  speech  and  game  playing. 

306  While  the  focus  of  this  paper  is  on  a  theoretical  analysis  of  algorithm  1  given  a  distance  and  transfer 

307  function,  we  can  find  examples  in  the  literature  of  simple  distance  and  transfer  functions  that  work 

308  very  well  in  practice.  Previous  work  on  PAC  reinforcement  learning  that  does  not  use  a  transfer 

309  function  but  imposes  smoothness  constraints  on  the  value  function  across  an  entire  MDP  family  [18] 
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310  is  implicitly  using  the  identity  transfer  function  and  dtr  (1q.  Brunskill  et  al.  [2J  present  CORL,  a 

311  PAC  algorithm  for  intra-task  transfer  in  continuous  state,  discrete  action  MDPs,  where  transitions  can 

312  be  described  as  an  offset  from  the  current  state  plus  Gaussian  noise.  States  are  classified  into  types, 

313  with  the  parameters  of  the  dynamics  being  the  same  for  all  states  in  the  same  type.  In  effect,  CORL  is 
3u  using  the  relative  transfer  function,  and  a  distance  function  that  returns  0  for  states  of  the  same  type 

315  and  infinity  for  states  of  different  types.  The  authors  present  experiments  with  a  real  robot  navigating 

316  a  multi-surface  environment,  and  show  that  a  camera  can  be  used  to  successfully  classify  the  surface 

317  into  different  types.  Since  the  algorithm  by  Brunskill  ct  al.  arises  as  a  special  case  of  our  algorithm, 
3i6  the  experimental  results  in  that  paper  can  be  replicated  by  our  algorithm. 

319  While  the  local  linearity  assumption  made  by  the  relative  transfer  function  is  relatively  uncommon  in 

320  the  reinforcement  learning  literature,  it  is  a  common  assumption  in  control  systems.  It  has  been  used 

321  in  practical  applications  ranging  from  temperature  and  voltage  regulation,  to  autopilot  systems  for 

322  recreational  drones  and  commercial  aircraft.  Nevertheless,  there  exist  domains  for  which  improving 

323  on  the  relative  transfer  function  requires  only  minimal  domain  specific  knowledge,  such  as  for 

324  example  domains  with  saturating  or  quantized  dynamics. 

325  6  Related  work 

326  The  work  of  Brunskill  et  al.  [2]  described  in  the  previous  section  is  the  only  existing  PAC  algorithm 

327  for  intra-task  transfer.  Our  work  generalizes  that  idea  to  arbitrary  transfer  functions  and  noise,  as 
326  well  as  to  inter-task  transfer.  Additionally,  algorithm  I  enjoys  significantly  better  dependence  on  the 

329  covering  number  and  planning  horizon. 

330  Mann  and  Choe  [15|  introduce  weak  admissible  heuristics,  and  show  how  they  can  be  used  for 

331  transfer  learning.  Weak  admissible  heuristics  transfer  values  rather  than  samples,  which  makes 

332  them  orthogonal  to  our  approach.  It  would  be  straightforward  to  construct  an  algorithm  that  takes 

333  advantage  of  both  a  transfer  function  and  a  weak  admissible  heuristic.  The  concept  of  a  transfer 

334  function  is  related  to  inter-task  mappings  [25],  a  type  of  transfer  function  for  inter-domain  value 

335  function  transfer. 

336  While  transfer  learning  learning  has  attracted  significant  attention  from  file  reinforcement  learning 

337  community  (see  for  example  file  survey  by  Taylor  and  Stone  [24])  only  a  handful  of  algorithms  exist 

338  for  which  theoretical  guarantees  have  been  proven.  Lazaric  and  Rested  i  [12]  present  three  transfer 

339  learning  algorithms  (AST,  BAT,  and  BTT),  and  prove  performance  bounds  for  AST  and  BAT.  All 

340  three  algorithms  assume  that:  1)  There  is  a  number  of  source  tasks  and  a  single  target  task.  2)  We 

341  have  access  to  generative  models  for  the  environments.  3)  The  target  task  can  be  expressed  accurately 

342  enough  as  a  linear  combination  of  the  source  tasks.  AST  uses  all  samples  collected  from  the  source 

343  tasks  to  leam  a  value  function  for  the  “average  MDP”.  Using  a  generative  model  of  file  target  task, 

344  BAT  tries  to  find  the  best  proportions  from  which  to  generate  samples  from  file  source  tasks.  Finally 

345  BTT,  for  which  no  theoretical  guarantees  are  known,  allows  constraints  to  be  placed  on  how  many 

346  samples  can  be  generated  from  each  task.  These  algorithms  could  be  useful  in  a  scenario  where 

347  evaluating  the  target  task  is  significantly  more  computationally  expensive  than  evaluating  the  source 

348  tasks  (as  long  as  file  target  task  can  also  be  approximated  accurately  enough  as  a  linear  combination 

349  of  the  source  tasks). 

350  Brunskill  and  Li  [3],  as  well  as  Liu,  Guo  and  Brunskill  [14]  focus  on  the  identification  task  for  file 

351  discrete  and  continuous  settings  respectively.  Compared  to  naively  starting  exploration  from  scratch 

352  every  time  a  task  is  sampled,  both  algorithms  are  shown  to  offer  an  improvement  in  the  aggregate 

353  sample  complexity  when  run  for  a  finite  number  of  tasks.  This  improvement  is  achieved  because 

354  finding  which  cluster  a  task  belongs  to  requires  fewer  samples  than  exploring  in  that  task  from  scratch. 

355  However,  transfer  learning  is  not  used  to  reduce  the  sample  complexity  of  file  learning  phase.  The 

356  sample  complexity  of  the  learning  phase  is  the  number  of  clusters  times  the  sample  complexity  of 

357  learning  in  each  cluster  from  scratch.  'Ihe  algorithms  presented  in  those  papers  could  be  combined 

358  with  the  approach  presented  in  this  paper,  achieving  the  best  of  both  worlds. 

359  The  fact  that  pessimism  in  the  face  of  uncertainty  can  be  advantageous  has  been  previously  identified 

360  by  the  robust  MDP  literature  [20],  Our  work  identified  a  new  advantage  of  pessimism,  and  contributed 

361  a  more  computationally  efficient  algorithm.  While  robust  MDP  algorithms  require  solving  multiple 

362  linear  programs  at  every  iteration,  algorithm  1  is  no  more  computationally  expensive  than  maximum 

363  likelihood  value  iteration. 
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427  7  Analysis 


42e  We  will  begin  our  analysis  by  extending  die  theory  of  Bellman  error  MDPs  [18]  to  the  episodic 

429  setting.  Bellman  error  MDPs  will  allow  us  to  prove  bounds  that  depend  on  how  well  some  policy  tt’ 

430  is  covered,  rather  than  requiring  all  state-action-MDP  triples  to  be  covered  with  samples.  Once  we 

431  have  extended  Bellman  error  MDPs  to  the  episodic  setting,  we  will  show  that  due  to  the  pessimistic 

432  nature  of  algorithm  1,  the  Bellman  error  for  the  policy  followed  by  the  algorithm  will  not  be  too 

433  positive  for  any  point  in  the  discretization  set.  We  also  show  that  the  Bellman  error  for  some  policy 

434  tt*  will  not  be  too  negative  for  any  point  in  the  discretization  set.  We  then  extend  the  proof  of  the 

435  above  to  any  state-action-MDP  triple.  Finally,  we  use  these  properties  to  show  that  as  long  as  certain 

436  conditions  arc  met.  algorithm  1  will  perform  well  with  high  probability. 

437  Let  m  be  an  MDP  (5,  A,  P,  P,  7,  H)  with  Bellman  operator  B n  for  policy  x,  and  Q  an  approxi- 

438  mate  value  function  for  to.  Additionally,  let  0  <  Qx(x)  <  Qmax  Vz,7r.  Let  the  Bellman  error 

439  MDP  rrif^'Q)  be  an  MDP  which  differs  from  m  only  in  its  reward  function  which  is  defined  as 

440  Pt(*,Q)(z)  —  Q(x)  —  BnQ(x)  (Q(x)  and  P*  are  the  approximate  value  function  and  Bellman 

441  operator  of  the  original  MDP  to).  We  will  use  B**w  q j  to  denote  Bellman  operator  under 

442  policy  tt,  and  as  well  as  ^  to  denote  its  value  function. 

443  Most  results  on  Bellman  error  MDPs  follow  directly  from  theorem  7.1  and  basic  properties  of  MDPs. 

444  Theorem  7.1.  The  return  of  tt  over  m  is  equal  to  Q  minus  the  return  of  tt  over  mt(XjQ)3; 

Q”(x)  =  Q(x)-Q^q)(x)V(x). 

445  Proof.  We  will  use  induction  to  prove  our  claim,  starting  with  x/,  =  H  —  1  as  our  base  case:  From 
w6  the  definition  of  the  Bellman  error  MDP  we  have  that: 

~a-l)  =Q(x,  \xk  =  H-l)~  B"Q(x,  |*fc  -  H  -  1) 

=  Q(x,  |*h  =  H  -  1)  -  Q"(x,  | xh  =  S-  1). 

w7  Let  Qn(x,h)  -  Q(x,  h)  -  ,  Q Vx  hold  for  some  Xh  >  1  we  will  prove  that  this  implies  it 

44e  also  holds  for  x'h  =  x*  —  1: 

Q*<i,,Q)(x\xh  =  h-l) 

=«(* l*h  =  h  - 1)  -  B*Q(x\xk  =  h  -  1)  +  yj  P(x'|x,tr)Q,%i<5)(x'|x;  =  h) 

=  Q(x|xfc  =  h-l)-j  P(x'|x, t)  (J2(x, xi)  -  7 Q(x'\x'h  =  h)) 

+  7  J  P(xf\x,x)  (Q(x'|xJ>  =  h)-Q* (x'|xj,  =  h)) 

=  Q(x |x*  =  h-\)-J  P(xf\x,ir)  (R(x,x't)  +  7<n*Vk  =  h)) 

=  Q(x\xh  =  h  -  1)  -  Q*(x\x/,  =h-  1). 

449  □ 

450  Corollary  7.2  bounds  the  range  of  the  Bellman  error  MDP  value  function,  a  property  that  will 

451  prove  very  useful  in  the  analysis  of  our  algorithm.  It  follows  from  Theorem  7.1  and  the  fact  that 

452  0  <  Q”(x,  h)  <  Qmw[. 

453  Corollary  7.2.  IxtQ  <  Q{x)  <  Qmax  V(x).  Then: 

“Qmax  £  Q<(jt,Q)(x)  ^  Qmax  x). 

454  Lemma  7.3  below  (a  consequence  of  theorem  7.1),  proves  that  the  difference  between  the  expected, 

455  discounted  reward  of  an  approximately  optimal  and  the  greedy  policy  over  Q  is  bounded  above  by 

456  the  inverse  difference  of  the  value  of  those  policies  in  their  respective  Bellman  error  MDPs. 

3  Note  that  this  is  true  for  any  policy  tt  not  just  the  greedy  policy  over  Q. 
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457  Lemma  7.3.  V(sT  m,  h,  Q): 

V*\s,m,h)-V*Q(*,m,h)  <  vg9  Q)(8,m,h)  -  V^Q)(g,m,h) 


458  Proof.  Let  xt  =  8,  xm  =  m,  and  ih  =  h. 

jp(x\r)Q(x)  <  j  P(x\*1)Q(x)  =► 

J'P(x\i')(Q*'(x) +  (%„#)(*)) 

<  J  P(x|t«)  (Q'9(z)  +  Q£9iQ)(xj)  * 

Jp(x\r)Q*‘(x)  -  J  P(x\*0)Q*q(x)  < 

l  p(a«q)q:U,qM)  - 1  pw)Q&‘.Q)(*) 

V*\s,m,h)  -  <  vgs  Q)(s,m,h)  -  V^Q)(s,m,h). 

459  □ 

460  Lemma  74.  l^et  Pfy(s,m,€x)  be  a  policy  cover  for  i r  and  fx  P(x\ir)  (Q(x)  —  B*Q(x))  > 

461  — «o  V(x)  €  /^(s,m,€R).  Then 

>  ~&o  -  *7T 

462  wAere  £  =  H  if  7  =  1  or  £  =  min  |  H,  ^3^  ]>  otherwise. 

463  Proo/  The  expected  reward  in  the  Bellman  error  MDP  is  bounded  below  by  —  «q  for  all  state-actions 

464  in  the  policy  cover.  Since  the  expected  number  of  state-actions  outside  the  policy  cover  encountered 

465  in  FI  steps  is  bounded  above  by  3^,  and  fx  P(x|tt)  (Q(x)  -  BnQ(x))  >  — Qmax.  we  have  tliat. 

466  V*(.ir ,Q)(®> m> h)  -  ~&o  -  fir-  □ 

467  Definition  7.5.  The  approximate  pessimistic  Bellman  operator  is  defined  as 

r,*; S/  \  £?=  1  R[x,x'itt)  +  tQ(s{)  -  d<2,x<)» 

B  Q{x)  = - ^ - 

466  where  J  is  the  i-th  sample  returned  by  nrik{x,S),  x'i  t  =  #,x)t  x'i  a  =  ir(x'it), 

469  and  x\  m  =  xm. 

470  Lemma  7.6  below  bounds  die  probability  diat  there  exists  an  clement  in  the  discretization  set  widi 

471  Bellman  error  of  unacceptably  high  magnitude. 

472  Lemma  7.6.  Let  Q  be  the  value  function  produced  by  algorithm  I,  and  k  >  In 

473  The  probability  that  there  exists  at  least  one  x  e  D(S ,  m.,do)  such  that 

Q(x)-B^Q(x)>e.  +  u 

474  or 

Q(x)  -  B*‘Q(x)  <  -  «c -2dk(x,S,dtr) 

475  is  bounded  above  by  6. 

476  Proof  Let  Y  be  the  set  of  k  samples  used  to  compute  BKQ(x)  for  a  fixed  (Q.x)  and  define 

477  /x(2i,. . .  Zk)  =  B*Q(x),  where  z\,. .  .Zjg  are  realizations  of  independent  (from  the  Markov 
47e  property  )  variables  whose  outcomes  are  possible  next  states,  one  for  each  sample  in  V’.  The  outcomes 
479  of  the  variables  (which  is  where  the  Markov  property  ensures  independence)  arc  the  next  states  the 
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480  transitions  lead  to,  not  the  state-action-MDP  triples  the  samples  originate  from.  The  state-act ion-M DP 

481  triples  the  samples  originate  from  are  fixed  with  respect  to  /,  and  no  assumptions  are  made  about 

482  their  distribution.  Additionally,  Q(x)  is  fixed  with  respect  to  /  (we  arc  examining  the  effects  of  a 

483  single  application  of  B*  Q(x)  to  the  fixed  value  Q(x),  while  varying  the  next-states  that  samples  in 

484  V  land  on).  Then: 

sup  |/(*1,...  2jfc)  —  f(zu...,Zi-lZi,Zi+l...Zk)\  =Ci< 

zit...zk,2i  * 

485  and  JDi-i  ( °i )2  <  .  From  McDiarmid’s  inequality  we  have 


488  and 


71^  A/ 


A/ 


Q(ar)  -  E  |iT  Q(a?)J  >  €, 

=  p  ( r*  (2i, . . .  2t)  -  e  [r*  (2i, . . .  2*)]  > «, 


< 


2\D(S,m,dQ)\' 


=  p(f 


Q(*)]  < 

(2l,...2t)  -E  |/*  (21,.  ..2fc)]  <  -e,  j 


3*2* 


<e  <  e~ 


2\D(S,m,dQ)\ 

«7  From  definition  4.4  we  have  that  B*Q(x)  —  ec  —  2<4(x,  S',  dtr)  <  E  |B“Q(ar)j  <  B" Q(x)  +  «c. 

ass  From  Itic  definition  of  our  algorithm  wc  have  that  B '  Q(r)  <  B*^  Q(x)  =  Q(x).  Substituting 
«9  above  we  have  that  for  a  fixed  x  where  x  €  D(S,  m,  riq) 

<S 


P  Q(z) 


-  B**Q{x)  >«.  +  ecj 


2|Z)(S,m,(i(j)|’ 


,w  and 


Q(x)  -  B*  Q(x )  <  -e,  -  cc  -  2  <4(z,  «,<**) 

<5 


2|D(S,m,dQ)| 

,9i  Taking  a  union  bound  over  all  x  €  D(S,m,dq)  completes  our  proof. 


□ 


Lemma  7.7  extends  the  result  from  lemma  7.6  to  the  entire  set  of  state-actions. 
,93  Lemma  7.7.  Let  Q  be  the  value  function  produced  by  algorithm  /.  and  k  > 
494  The  probability  that  there  exists  at  least  one  x  such  that 

Q(x)-B**Q(x)>e.+  2ec, 


,95  or  at  least  one  x  such  that  dir,  / )[S,  m.  dQ))  <  dq  and 

Q(x)  -  B"  Q(x)  <  -e,  -  2ec  -  2 dk(x,S,dCT) 

,9c  is  bounded  above  by  6. 
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497  Proof.  If  d(x,D($,  m,dQ))  >  dQ  then  Q(x)  =  0.  Since  Br  '  Q(x)  >  0  we  have  that  Q(x)  — 

498  B*** Q(x)  <  0  <  €,  +  2ec.  Lets  =  arg<f(x,  D(S,m,dQ)).  Then 

Q(x)-B*£‘Q(x)  =  Q(x)-B'iQ(x) 

<Q(l)-B^Q(S)+ec, 

499  and  if  d(x,  D(S,m,dQ))  <  dQ 

Q(x)-B*’Q(x)  =  Q(x)-B*'Q(x) 

>Q(t)-B*‘Q(i)-tc, 

500  where  in  both  inequalities  we  used  the  definition  of  Q  and  definition  4.4.  Substituting  into  lemma  7.6 

501  concludes  our  proof.  □ 


502  7.1  Proof  of  theorem  4.6 

503  Proof.  From  lemma  7.7  and  the  fact  that  dk(x,  S)  <  dtr  V  x  €  Pjj  (s>  m,  r  „),  we  have  that  with 

504  probability  at  least  1  —  S 

Q(x)-B**Q(x)<t.  +  2eeVx, 


505  and 


Q(x)  -  B *  Q(x)  >  -e,  -  2ee  V(i)  e  Pjf  (s,m,ew). 
,4 


506  It  follows  that  fx  P(z|ir^)(}*^  g>(x)  <  £(e«  +  2ec)  and  from  lemma  7.4  we  have  that 

507  Jx  P(x\i*)Q*(x.  Q)(x)  >  — f(e>  +  2<c)  —  e*.  Substituting  into  lemma  7,3  we  have 

=  J  P(x\n',x„  =  0)Q*"(x)  -  J  P(x |tr«,*h  =  0 )Q”*(x) 

<  J'P(x\*\xh  =  0 )Q*'  (x)  -f  p(* \*<i,xh  =  0 )Q**  (at)  4-  <* 


<  J  P(x\*4,x„  =  0)Q^q)(x)  -  J  P(x\r,xh  =  0)Q*;.,g)(a) 


+  €* 


<  £(*s  4  2 ec)  +  £(es  4  2ec)  4  e* 
=  2£(cs  4  2e„)  4  c*. 
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1  Abstract 

Probably  approximately  correct  (PAC)  algorithms  for  Reinforcement  Learning 
has  shown  great  theoretical  guarantees.  However,  the  large  sample  complexity 
has  been  the  bottleneck  of  its  practical  applications.  Transfer  learning,  on  the 
other  hand,  can  potentially  reduce  the  sample  complexity.  In  this  project,  we 
introduce  the  intra-task  transfer  learning  into  the  PAC  reinforcement  learning 
algorithms.  We  demonstrate  the  sample  complexity  reduction  from  a  pendulum 
experiment. 

2  Introduction 

In  the  last  few  years,  PAC  reinforcement  learning  has  gained  some  popularity  for 
its  nice  theoretical  guarantee  that,  with  high  probability,  the  algorithm  will  pro¬ 
duce  a  policy  that  performs  almost  as  well  as  the  optimal  policy.  The  required 
sample  size  is  guaranteed  to  be  at  most  polynomial  to  the  parameters  of  the 
problem.  Such  parameters  include  1)  the  size  of  the  state-action-MDP  space,  2) 
the  planning  horizon,  3)  the  tolerance  of  the  deviation  from  the  optimal  policy, 
and  4)  the  probability  of  failure. 

Previous  works  showed  that  providing  PAC  guarantees  requires  large  sam¬ 
ple  complexity  in  many  cases  [1]  [2j  [3].  In  RL,  the  state-action  MDP  space 
contributes  the  most  to  the  sample  complexity.  In  this  project,  we  want  to 
reduce  the  state-action  MDP  space  while  maintaining  the  PAC  guarantees  by 
introducing  an  intra-task  transfer  function.  We  want  to  show  that  in  practice, 
this  new  PAC  transfer  algorithm  can  indeed  reduce  sample  complexity. 


3  Method 

3. 1  Notations 

3.1.1  MDP 

Markov  Decison  Process  family  consists  of  state  space  S,  action  space  A,  MDP 
family  M,  Markov  transition  model  P,  reward  function  R,  discount  factor  7,  and 
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horizon  time  //.  Qw(x,  h)  is  defined  as  the  expected,  accumulated,  discounted 
reward  from  x  at  step  /*,  when  all  decisions  from  step  k  +  1  follow  policy  7r. 

3.1.2  Sample  Set 

Sample  set  S  is  defined  as  the  set  of  all  samples  retained  by  our  algorithm. 

3.1.3  Distance  Function 

d(x,x)  is  defined  to  be  the  distance  of  two  different  state- action-M DPs  in  some 
well-defined  distance  metric. 

3.1.4  Cover  Set 

The  cover  set  C(S,  m)  is  defined  as  the  portion  of  the  state-action  space  where 
at  least  one  sample  is  loss  than  Qmax • 

3.1.5  Intra-task  transfer 

A  transfer  function  ftp(x,x^,x)  — ►  x't  is  defined  as  function  that  takes  in  a 
state-action-MDP  triple,  an  associated  next  state,  and  a  target  state- action- 
MDP  triple,  and  predicts  the  next  state  of  the  target  state-action-MDP  triple. 

3.1.6  Discretization  Set 

A  discretization  set  D(S,m,dq)  is  defined  as  a  discretization  of  the  cover  set 
C(S,m),  such  that  for  every  x  €  C(S,m),  there  exist  x  €  D(S,m,dq),  such 
that  d(x,x)  <  dQ 

3.2  PAC  Transfer  Algorithm 

The  PAC  TVansfer  Algorithm  is  shown  below.  It  consists  of  two  main  stages. 
The  first  stage  is  to  calculate  the  value  function  Q  using  the  sample  set  and  the 
discretization  set,  along  with  the  transfer  function.  The  second  stage  is  to  use 
the  calculated  Q  to  apply  policy. 

4  Experiment 

4. 1  Setups 

We  consider  the  following  pendulum  problem.  Suppose  we  have  a  pendulum 
hanging  downwards.  We  want  to  apply  forces  to  its  attached  cart  so  that  we 
can  bring  it  as  close  as  the  balanced  position  as  possible.  The  state  space  then 
consists  of  the  vertical  angle  0  and  the  angular  velocity  0.  The  action  space 
is  {—50,0,50}.  Uniform  noise  in  [—10,10)  is  also  added  to  the  actions.  The 
reward  is  set  to  be  1  if  \9\  <  tt/2;  it  is  set  to  be  0  otherwise.  A  discount  factor 


2 


Approved  for  Public  Release;  Distribution  Unlimited 
69 


Algorithm  1  PAC  Transfer  Algorithm 
1:  Inputs:  MDP  ra,  horizon  H,  sample  set  S,  number  of  neighbors  k,  transfer 
function  ftTy  distance  function  d(),  and  constant  dQ 
2.  Generate  discretization  set  D(S,  m,dq) 

3;  Set  Qcur(x)  =  0,  Qprev(x)  =  0,Vx  €  D(S,m,dg) 

4: 

5:  for  h  =  1 :  H  do 

6:  for  x  €  D(S,  m,  cIq)  do 

7;  Qc w(i)  =  j  R(x,x'i  s,h)  +  ■fQiir,!V(x'i)  -  d(x,x 

8  where  =  a'(,£'(  n,  =  m,f'M  =  ftr(xi,  x'u,x) ,  and  x'{,)  is 

the  ith  sample  returned  by  NNk(x,S) 

9:  Set  Qprev(x}  ~  Qcu r(^) 

10: 

11  while  terminal  condition  unsatisfied  do 
12:  Perform  action  —  argmax,faQcur(x ) 

13: 

14:  procedure  NNk(x,  S) 

15:  return  the  k  nearest  samples  to  x  in  S 


of  0.9  is  used.  We  assumed  that  the  absolute  magnitude  of  the  angular  velocity 
will  not  exceed  3. 

4.2  Generation  of  the  Sample  Set  and  Discretization  Set 

Since  we  have  a  2-dimensional  state  space,  (0,0),  we  want  to  sample  a  Dis¬ 
cretization  Set  of  size  m  x  n.  We  first  evenly  divide  the  state  space  into  a  m  X  n 
grid,  so  that  cellij  represents  (0,0),  where  ] 

0  £  [— 7t/2  +  (i  —  1)  x  ir/m,  — 7r/2  4-  i  x  ir/m] 

0  e  [-3  +(j  -  1)  x  6/n,  -3  +  j  x  6/n] 

We  first  construct  the  sample  set  S  by  sampling  t  pairs  of  (0,0)  from  each  cell 
we  just  defined.  We  then  construct  the  discretization  set  by  randomly  taking 
one  of  the  sampling  point  from  each  cell.  In  this  way,  for  each  of  the  three 
actions,  we  can  generate  a  sample  set  of  size  m  x  n  x  t,  and  a  corresponding 
discretization  set  of  size  mx  n. 

4.3  Model  Detail 

We  set  the  distance  function  to  be  the  norm-2  distance  of  the  state  vector.  Note 
that  we  set  the  distance  to  infinity  for  those  with  different  actions.  We  chose 
the  horizon  to  be  20  since  this  experiment  does  not  require  much  long  term 
planning.  We  used  2t  nearest  neighbors  in  our  model. 
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4.4  Transfer  Function  Specification 

We  first  did  a  sanity  check  using  the  identity  transfer  function.  That  is,  ftr  (x,  x*,x 
x' .  We  then  applied  our  PAC  transfer  algorithm  on  a  more  complicated  transfer 
function,  that  is  symmetric  transfer  function.  Intuitively,  if  a  person  can  per¬ 
form  well  on  the  pendulum  balancing  task  at  a  state  (0o,#o)>  such  skill  should 
be  able  to  transfer  to  perform  well  at  a  new  state  (— #o»  — $())•  If  we  believe  that 
x  is  close  to  — x ,  then  the  symmetric  transfer  function  is,  ftr(x,x;,x)  -►  -x;. 
The  possible  existence  of  symmetry  can  be  checked  in  the  NNk  function  in  the 
algorithm  by  changing  the  method  of  distance  calculation.  We  can  then  simply 
add  an  indicator  variable  specifying  whether  symmetry  exists.  Such  indicator 
variable  is  also  returned  in  the  NNk  function. 

4.5  Identity  Transfer  vs.  Symmetric  Transfer 

To  quantitatively  compare  the  result  from  an  identity  transfer  and  a  symmetric 
transfer.  We  varied  the  number  of  samples  that  Sample  Set  sampled  from 
each  cell,  that  is,  t.  For  different  values  of  t%  we  recorded  the  total  number  of 
iterations  the  experiment  lasted  before  encountering  a  failure  (|0|  >  7r/2).  For 
computation  simplicity,  we  cap  the  maximum  of  iteration  to  be  20,000.  We 
repeat  the  experiment  for  ten  times.  The  results  can  be  shown  in  Table  1  and 

Fig.  l. 

5  Result  and  Conclusion 

FYom  the  result,  we  can  observe  that,  with  a  fixed  samples  per  cell  in  the  sample 
set,  PAC  Transfer  algorithm  with  symmetric  transfer  function  tends  to  last  a 
lot  more  iterations  then  with  a  simple  identity  transfer  function,  the  variance  of 
the  symmetric  transfer  function  result  also  more  quickly  converges  to  zero  since 
there  is  the  20,000  max  iteration. 

The  results  match  our  expectation  since  with  a  symmetric  transfer  function, 
the  number  of  samples  required  to  achieve  similar  performance  should  be  roughly 
cut  into  half. 

A  example  of  the  heatmap  of  the  learned  Q  matrix  is  shown  in  Fig.  2.  The 
corresponding  best  policy  is  shown  in  Fig.  3.  Note  that  these  two  plots  are 
just  for  one  scenario  where  we  sample  3  pairs  per  cell  with  the  identity  transfer 
function. 

Above  all,  We  can  conclude  that  the  PAC  transfer  learning  algorithms  can 
indeed  reduce  sample  complexity  in  some  practical  applications. 


Sample  per  cell 

i 

2 

3 

4 

16,045 

5 

20,000 

Identity  Transfer 

51 

493 

5,782 

Symmetric  Transfer 

380 

12,531 

20,000 

20,000 

20, (XX) 

Table  1:  Identity  vs.  Symmetric 


4 


Approved  for  Public  Release;  Distribution  Unlimited 
71 


Figure  1:  Identity  vs.  Symmetric  Results  with  Error  Bars 


6  Future  Work 

Even  though  this  project  shows  the  practical  application  of  the  FAC  Trans¬ 
fer  learning  algorithms  for  RL,  the  experiments  we  conducted  were  still  fairly 
simple.  To  show  that  this  can  be  applied  to  more  real-world  problems,  we 
need  to  design  some  more  complicated  experiments.  Some  examples  related  to 
pendulum  experiment  include  introducing  some  transfer  function  for  different 
pendulum  mass  and  length. 

Furthermore,  although  transfer  function,  if  given,  can  be  very  powerful  in 
terms  of  reducing  the  sample  complexity,  defining  different  transfer  functions  for 
every  problem  seems  to  be  unrealistic.  How  to  learn  a  general  purpose  transfer 
function  that  can  apply  to  various  problems  is  something  worth  further  study 
on. 
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Figure  2:  A  heatmap  of  the  learned  Q  matrix  with  3  samples/cells,  identity 
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Figure  3:  A  healmap  of  the  best  policy  corresponding  to  Fig.  2 
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Abstract 

Reinforcement  learning  is  concerned  with  how  agents  ought  to  take  actions  in  an 
environment  so  as  to  maximize  cumulative  reward.  Transfer  learning,  on  the  other 
hand,  typically  refers  to  attempts  to  decrease  training  time  by  learning  a  source  task 
before  learning  the  target  task.  There  have  been  many  approaches  to  do  transfer 
learning  within  the  reinforcement  learning  domain.  In  this  paper,  we  present  a  new 
method  of  doing  so  by  using  Generative  Adversarial  Networks. 


1  Introduction 

In  reinforcement  learning,  transferring  knowledge  gained  from  tasks  solved  earlier  to  solve  a  new 
target  task,  can  help,  either  in  terms  of  speeding  up  the  learning  process  or  in  terms  of  achieving  better 
final  performance.  Various  methods  have  been  proposed  to  do  transfer  given  know  a  relationship 
between  tasks.  Other  work,  however,  focuses  on  using  different  tools  to  autonomously  discover  such 
a  relationship  and  then  do  transfer.  Due  to  the  hardness  of  this  problem,  it  remains  a  big  challenge  to 
effectively  leam  good  relationship  between  tasks. 

In  this  paper,  we  proposed  a  novel  method  to  use  Generative  Adversarial  Networks  (GAN)  to  solve 
this  challenge.  The  GAN  framework  is  a  deep  learning  model  recently  introduced  by  Goodfellow,  et 
al.  (2014).  It  can  draw  samples  with  arbitrary  distribution  by  letting  a  generator  and  a  discriminator 
to  play  a  minimax  game.  It  could  also  model  a  conditional  distribution  and  thus  help  to  make  huge 
progress  in  domain  adaption  problems.  Inspired  by  GAN’s  success  in  domain  adaption  problems,  we 
aim  to  leam  a  mapping  between  states  in  two  tasks  in  reinforcement  learning  and  then  transfer  the 
policy  accordingly. 

2  Related  Work 

In  this  section,  we  briefly  review  advances  in  transfer  reinforcement  learning  and  Generative  Adver¬ 
sarial  Networks  so  that  it  will  be  more  natural  to  introduce  our  proposed  mctliod. 

2.1  Transfer  Reinforcement  Learning 

As  pointed  out  by  Taylor,  et  al.  (2009),  different  transfer  algorithms  have  different  problem  settings 
and  make  different  assumptions.  However,  they  generally  share  following  steps:  1 .  Given  a  target 
task,  select  an  appropriate  source  task(s)  to  transfer  from.  2.  Leam  how  the  source  task  and  target 
are  related.  3.  Effectively  transfer  knowledge  from  the  source  task(s)  to  the  target  task.  One  option 
to  provide  the  relationship  between  the  source  task  and  target  task  is  by  inter-task  mapping.  For 
example,  such  mapping  can  be  decomposed  into  action-mapping  and  state- variable  mapping  between 
two  tasks.  From  the  beginning  of  2000.  much  work  focuses  on  cases  where  inter-task  mapping  is 
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available.  In  other  words,  they  assume  the  second  step  is  provided  by  a  human.  More  recently,  much 
work  also  tries  to  learn  a  mapping  between  two  tasks  autonomously.  Usually,  one  type  of  alignment 
method  needs  to  be  used  to  find  a  good  mapping.  After  the  mapping  is  provided,  there  are  a  handful 
of  ways  to  do  transfer.  For  example,  we  can  use  the  Q-valucs  learned  from  the  source  task  to  initiate 
the  Q-values  in  the  target  task.  The  other  option  is  to  directly  transfer  without  mapping.  In  deep 
reinforcement  learning,  for  instance,  we  can  transfer  the  weights  of  neural  network  to  solve  the  new 
task.  This  can  be  viewed  as  a  way  to  transfer  the  features.  Using  some  more  advanced  algorithms,  we 
can  also  transfer  the  policy  as  well,  and  then  fine-tune  that. 

2.2  Generative  Adversarial  Networks 

Generative  Adversarial  Networks  (GANs)  have  led  to  significant  improvements  in  image  generation 
Radford,  ct  al.  (2015).  The  basic  idea  of  GANs  is  to  simultaneously  train  a  discriminator  and  a 
generator.  'ITie  discriminator  is  trained  to  distinguish  real  samples  of  a  dataset  from  fake  samples 
produced  by  the  generator.  The  generator  uses  input  from  an  easy-to-sample  random  source,  and 
is  trained  to  produce  fake  samples  that  the  discriminator  cannot  distinguish  from  real  data  samples. 
From  a  game  theory  point  of  a  view,  the  convergence  of  a  GAN  is  reached  when  the  generator  and 
the  discriminator  reach  a  Nash  equilibrium. 

There  are  two  threads  of  GAN  research  moving  on  quickly.  One  is  to  make  GANs  more  stable 
and  powerful.  Wasserstein  GAN  (WGAN)  [6],  Energy  Based  GAN  (EBGAN)  [7J  and  Boundary 
Equilibrium  GAN  (BEGAN)  [8]  are  recent  noteworthy  examples  on  this  track.  The  other  track  is  to 
apply  GANs  to  other  domains  and  achieve  state  of  the  art  results.  Recent  work  on  domain  transfer  is  a 
great  example.  Almost  concurrently,  DiscoGAN  [9]  and  CyclcGAN  [10]  produce  astonishing  results 
in  image  to  image  translation  task.  They  can  unsupervisedly  discover  paired  relationships  between 
instances  from  two  datasets  surprisingly  well.  Even  though  state  space  is  very  different  from  image 
space,  this  makes  us  think  if  we  can  apply  GANs  to  do  transfer  in  Reinforcement  learning  domains. 

3  Method 

3.1  Problem  Formulation 

Before  we  introduce  our  method,  we  first  formally  define  our  problem.  Reinforcement  Learning 
problems  arc  typically  formulated  as  a  Markov  Decision  Process  (MDP)  M  =<  S,  A,T,r  >,  S 
is  the  set  of  states,  A  is  the  set  of  actions  that  agents  could  execute,  T  :  S  x  A  x  S  — y  [0, 1]  is  a 
state  transition  probability  function  specifying  the  task  dynamics  and  r  :  S  x  A  x  S  -y  R.  A  policy 
7T  :  S  x  A  -y  [0, 1]  is  defined  as  a  conditional  probability  over  actions  given  state.  The  goal  for 
an  agent  is  to  sequentially  choose  actions  to  maximize  its  expected  return  during  interaction  with 
environment. 

Our  transfer  problem  considers  a  source  domain  with  MDP  M$  =<  S§,  As,T§,  r§  >  and  a  target 
domain  with  MDP  Mr  =<  St,At,Tt,tt  >.  In  general,  they  can  have  different  state  spaces, 
action  spaces,  dynamics  and  reward  functions.  One  way  to  transfer  knowledge  from  source  to  target 
is  by  mapping  optimal  <state,  action,  next  state>  pair  into  state  and  action  spaces  in  the  target  domain. 
To  do  so,  one  must  provide  an  inter-task  mapping  \  to  project  such  a  triple  from  source  to  target.  Such 
X  can  be  decomposed  into  two  sub-mappings:  an  inter-state  mapping  \S  and  inter-action  mapping 
XA- 


3.2  Alignment  with  GAN 

In  our  study,  we  want  to  leverage  GAN’s  strong  modeling  ability  to  learn  a  potentially  very  compli¬ 
cated  inter-task  mapping.  The  basic  assumption  we  make  is  that  the  dynamics  in  the  source  domain 
resembles  the  dynamics  in  the  target  domain.  Besides,  the  reward  function  should  also  share  some 
structural  similarity.  We  also  assume  the  inter-action  mapping  \A  IS  provided  and  only  desire  to  learn 
the  inter-state  mapping  xs- 

In  order  to  learn  an  inter-task  mapping,  we  need  examples  from  both  domains.  We  obtain  trajectories 
of  states  Ss  in  the  source  task  and  trajectories  of  states  St  in  the  target  task  by  applying  random 
policy  in  both  domains.  Because  we  assume  actions  arc  already  aligned,  we  extract  state-next  state 
pairs  from  both  domains  as  [(S^\  5^)]  and  [(5^,5^2))]  respectively. 
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We  define  a  function  G(S)  as  a  neural  network  to  project  states  from  source  domain  to  states  target 
domain.  It  takes  a  state  from  the  source  task  domain  as  input  and  outputs  a  state  in  the  target  task 
domain.  This  is  the  generator  in  the  GAN  framew  ork  and  is  implemented  by  a  multilayer  perceptron. 
We  also  define  a  second  multilayer  perceptron  D((S^\  Sffl))  that  outputs  a  single  scaler.  D(x) 
represents  the  probability  that  the  input  pair  comes  from  the  target  domain  state-next  state  pair 
distribution  rather  than  from  generator  G.  We  train  D  to  maximize  the  probability  of  assigning  the 
correct  label  to  both  training  examples  and  samples  from  G.  We  simultaneously  train  G  to  minimize 
log(\  -  D(G(S<l>),G(Sf ))). 

In  other  words,  D  and  G'  play  the  following  two-player  minimax  game  with  value  function  V(G,  D): 

mincmaxDV (D,  G)  =  ^logO^^f ')!  +  -E[log  (1  -  D(G(Sf),G(Sf))))\  (1) 

The  intuition  here  is  that  the  generator  has  to  find  a  projection  so  that  file  the  state-next  state 
distribution  is  aligned.  Because  the  generator  is  only  defined  on  the  state  space,  and  file  transition 
dynamics  are  similar,  we  should  expect  this  produces  nice  alignment  across  two  domains. 

3.3  Using  Alignment  for  Knowledge  Transfer 

After  we  learn  the  alignment  projector  G,  we  can  use  knowledge  from  the  source  task  to  guide  the 
training  in  file  target  task.  We  first  use  a  policy  gradient  method  to  learn  an  optimal  policy  ir$  in 
the  source  domain.  To  apply  transfer,  we  first  initialize  nr  by  training  only  on  the  transfer  reward 

f transfer • 


rtransfer  =  _ ||»s(Ss)  -  XT(G(Ss))| h  (2) 

This  term  forces  file  target  policy  to  output  the  same  action  as  the  source  policy  in  the  corresponding 
source  state.  Again,  here  we  assume  actions  are  already  properly  aligned  between  source  and  target 
tasks.  We  ran  a  standard  policy  gradient  method  on  this  reward  to  initialize  file  policy.  After  that,  we 
train  the  target  policy  on  the  real  target  reward  without  the  transfer  reward. 

The  whole  algorithm  is  summarized  in  the  following: 


Algorithm  1  Transfer  by  GAN 
1:  procedure  Learn  projector 

2:  Sample  pairs  (S§\ Sj2*)  from  source  task,  and  pairs  (S$\ S$)  from  target  task  both  from 

random  policies. 

3:  Learn  projector  G  by  training  a  GAN  with  equation  (1)  as  objective  using  above  samples 

4:  procedure  TRANSFER  INITIALIZE  POLICY 
5:  Sample  m  source  states  Ss  with  optimal  policy  -<,v> 

6:  Project  .S’ 5  to  target  domain  with  projector  G 

7:  Use  policy  gradient  to  train  on  transfer  reward  according  to  equation  (2) 

X:  Yield  initialized  target  task  policy  1 

9:  procedure  Improve  POLICY 
10:  Stall  with  it®  and  then  train  on  real  reward 

11:  Return  optimal  target  policy  jt^ 


4  Experiment  Result 

We  test  our  transfer  algorithm  on  the  Cart  Pole  environment  as  a  proof  of  concept  experiment.  Here, 
we  are  trying  to  transfer  knowledge  between  Cart  Pole  with  different  parameters. 

The  goal  of  Cart  Pole  (Figure  1)  is  to  swing  up  and  then  balance  the  pole  vertically.  The  system 
dynamics  arc  described  via  a  four-dimensional  state  vector  <x,x’,0,0'>,  respectively  representing  file 
position,  velocity  of  the  cart,  angle  and  angular  velocity  of  the  pole.  Actions  are  to  put  a  +1  or  -1 
force  to  the  cart.  Notice  that  the  actions  of  Cart  Pole  with  different  parameter  arc  naturally  aligned. 
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Figure  1 :  (’art  Pole  Environment.  Figure  2:  Transfer  Result. 


Figure  2  shows  average  target  task  reward  over  iterations  with  and  without  transfer.  It’s  plotted  after 
policy  is  initialized  by  transfer  for  the  case  with  transfer.  This  transfer  policy  initialization  phase 
only  trained  with  50  optimal  trajectories  with  1  iterations.  We  can  see  that  transfer-initialized  policy 
outperforms  standard  policy  gradient  trained  from  scratch.  It  has  much  better  initialized  performance 
and  converges  faster.  This  demonstrates  that  our  algorithm  is  capable  of  providing  helpful  target 
policy  initialization. 

We  applied  a  few  tricks  in  implementation  of  our  model.  Because  the  training  of  GANs  is  well-known 
to  be  unstable,  we  add  an  additional  penalizing  term  ||G(5)  -  S||2  to  encourage  the  transformed 
states  to  be  close  to  original  input  to  avoid  bad  initialization  by  random  projection.  We  gradually 
decay  the  weight  of  this  term  during  training  so  that  the  distribution  could  be  matched  better.  Also 
we  select  result  from  saved  training  checkpoints  based  on  how  close  the  projected  next  state  from  the 
actual  next  slate  in  the  target  task. 

5  Conclusion 

We  introduced  a  technique  for  autonomous  transfer  with  policy  gradient  reinforcement  learning. 
Our  approach  employs  Generative  Adversarial  Networks  to  generate  an  inter-task  mapping,  which 
is  then  used  to  transfer  source  knowledge  to  the  target  domain  on  the  Cart-Pole  environment.  We 
demonstrate  its  effectiveness  on  cart-pole  environment,  show  ing  it’s  capable  of  improving  the  agent’s 
initial  performance  and  convergence  speed. 
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1  Abstract 

Probably  approximately  correct  (PAC)  algorithms  for  Reinforcement  Learning 
has  shown  great  theoretical  guarantees.  However*  the  large  sample  complexity 
has  been  the  bottleneck  of  its  practical  applications.  Manifold  alignment,  on 
the  other  hand,  can  effectively  transfer  knowledge,  and  potentially  reduce  the 
sample  complexity.  In  this  project,  we  introduce  the  intra-task  transfer  learning 
into  the  PAC  reinforcement  learning  algorithms.  We  explore  sample  transfer 
through  manifold  alignment.  We  demonstrate  the  sample  complexity  reduction 
for  several  classic  control  experiments  with  a  perfect  transfer  function  and  a 
learned  transfer  function. 


2  Introduction 

In  the  Last  few  years,  PAC  reinforcement  learning  has  gained  some  popularity  for 
its  nice  theoretical  guarantee  that,  with  high  probability,  the  algorithm  will  pro¬ 
duce  a  policy  that  performs  almost  as  well  as  the  optimal  policy.  The  requirod 
sample  size  is  guaranteed  to  be  at  most  polynomial  to  the  parameters  of  the 
problem.  Such  parameters  include  1)  the  size  of  the  state- action-MDP  space,  2) 
the  planning  horizon,  3)  the  tolerance  of  the  deviation  from  the  optimal  policy, 
and  4)  the  probability  of  failure. 

Previous  works  showed  that  providing  PAC  guarantees  requires  large  sam¬ 
ple  complexity  in  many  cases  [2]  [3]  [4].  In  RL,  the  state-action-MDP  space 
contributes  the  most  to  the  sample  complexity.  It  is  natural  to  explore  the 
possibility  of  knowledge  transfer  to  reduce  sample  complexity.  One  idea  is  that 
samples  in  source  domain  and  target  domain  may  have  some  connections  in 
their  underlying  manifold  structures,  and  manifold  alignment  is  introduced  to 
align  their  manifolds  to  achieve  knowledge  transfer  [5].  A  special  case  of  such 
technique,  unsupervised  manifold  alignment,  is  shown  to  be  effective  in  inter¬ 
task  transfer  in  the  context  of  policy  gradient  R.L  [1].  In  this  project,  we  first 
introduce  a  PAC  Transfer  Algorithm,  in  which  we  demonstrate  that  an  intra¬ 
task  transfer  function  can  reduce  the  state-action  MDP  sample  complexity  while 
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maintaining  the  PAC  guarantees.  We  then  show  that  manifold  alignment  can 
be  used  to  learn  such  transfer  function  and  effectively  reduce  sample  complexity 
in  the  context  of  the  PAC  transfer  algorithm. 


3  Method 

3.1  PAC  Transfer  Algorithm 

3.1.1  Notations 

MDP:  Markov  Decison  Process  family  consists  of  state  space  5,  action  space 
A,  MDP  family  A/,  Markov  transition  model  P,  reward  function  R,  discount 
factor  7,  and  horizon  time  H.  Q*(x,h)  is  defined  as  the  expected, 
accumulated,  discounted  reward  from  x  at  step  h ,  when  all  decisions  from  step 
h  +  1  follow  policy  7 r. 

Sample  Set:  Sample  set  S  is  defined  as  the  set  of  all  samples  retained  by  our 
algorithm. 

Distance  Function:  d(x,x)  is  defined  to  be  the  distance  of  two  different 
stato-action-MDPs  in  some  well-defined  distance  metric. 

Cover  Set:  The  cover  set  C(S,m)  is  defined  as  the  portion  of  the 
state-action  space  where  at  least  one  sample  is  less  than  Qmax- 
Intra-task  transfer:  A  transfer  function  ftr(x,xj.,x)  — >  x't  is  defined  as 
function  that  takes  in  a  state-action-MDP  triple,  an  associated  next  state,  and 
a  target  state-action-MDP  triple,  and  predicts  the  next  state  of  the  target 
state-action-MDP  triple. 

Discretization  Set:  A  discretization  set  D(S,m,dq)  is  defined  as  a 
discretization  of  the  cover  set  C(S,m),  such  that  for  every  x  6  C(S,  in),  there 
exist  x  €  D(S,m,  cIq),  such  that  d(x,x)  <  dQ 

3.1.2  Detailed  Algorithm 

The  PAC  Transfer  Algorithm  is  shown  below.  It  consists  of  three  main  steps. 
The  first  step  is  to  derive  a  sample  set  S  and  a  discretization  set  D.  The 
second  step  is  to  calculate  the  value  function  Q  using  the  sample  set  and  the 
discretization  set,  along  with  the  transfer  function.  The  last  step  is  to  use  the 
calculated  Q  to  apply  policy. 

3 . 2  M an i fold  A 1  ignm  ent 

3.2.1  Problem  Definition 

We  first  want  to  define  the  problem  in  the  context  of  RL  state  mapping.  More 
specifically,  let  Xs  be  a  sample  set  of  state  from  the  source  domain,  where 
X9  is  the  manifold  of  X9.  Let  Xt  be  a  sample  set  of  state  from  the  target 
domain,  where  Xt  is  the  manifold  of  Xt.  We  have  partial  knowledge  about 
their  correspondence,  x*  6  Xa  -H-  x\  €  Xt.  We  want  to  map  Xs  and  Xt  to  a 
new  space,  modeling  their  correspondence,  and  preserving  the  local  structure  of 
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Algorithm  1  PAC  Transfer  Algorithm 

1.  Inputs:  MDP  ra,  horizon  H>  sample  set  S ,  number  of  neighbors  k>  transfer 
function  Ar,  distance  function  dQy  and  constant  cIq 
2:  Generate  discretization  set  D(S,  m,dq) 

3  Set  Qcur(x)  =  o,  Qprev(x)  =  0,Vx  e  D(S,m,dq) 

4: 

5:  for  h  =  1 :  H  do 

6:  for  x  €  D(S,  m,  dq)  do 

7:  Qcu r(l)  =  £  {°t  maI<  { -R(*»  **.». *)  +  "fQprrv(xi)  ~  <*(*.  X())}  }) 

8:  where  xja  =  a'- ,x'm  =  m,x'-#  =  ftr(xi, x{  #,x) ,  and  (x<,xj#)  is 

the  ith  sample  returned  by  NNk(xyS) 

9:  Set  Qprcv(x)  =  Qcur(x) 

10: 

11  while  terminal  condition  unsatisfied  do 
12.  Perform  action  =  argmaXx^Qcurix) 

13: 

14  procedure  NNk(x,  S) 

15:  return  the  k  nearest  samples  to  x  in  S 

each  set  at  the  same  time.  In  our  specific  context,  we  choose  to  use  feature- level 
alignment  since  it  is  easier  to  generalize  to  new  instance  and  better  fits  our 
application. 

3.2.2  Notations 

X, :  p$  X  m a  matrix,  representing  source  domain  sample  set 

Xt:  pt  x  mt  matrix,  representing  target  domain  sample  set 

W„:  mB  x  m,  relationship  matrix  of  the  source  domain  sample  set,  Wl,J  is  the 

similarity  of  x\  and  x{  (heat  kernel), 

Wt:  mt  x  mt  relationship  matrix  of  the  target  domain  sample  set,  is  the 
similarity  of  and  x{y  =  e-***(*!t*e)/* 

Ds:  m$  x  m,  diagonal  matrix,  where  D\^  =  53;  ^l'3 
Dt:  mt  X  mt  diagonal  matrix,  where  D\'J  =  53  Wf 
Lt:  Ls  -  Da-  W, 

Lt:  Lt  =  Dt-  Wt 

\V3tt:  m,  x  mt  correspondence  matrix.  Wl?t  =  1  if  X,  corresponds  with  xj; 
otherwise,  W*^  =  0. 

Hs:  ms  x  m,  diagonal  matrix,  =  53;  ^  s  j 

Ctt:  mt  x  mt  diagonal  matrix,  =  53» 
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D.  D  = 


r ,  T  f  — A*Ws,t  \ 

Lt  +  tfU) 

Fs:  a  p,  x  d  matrix  that  maps  each  x*  to  a  new  d-d imensional  space. 
Ft :  a  pi  x  d  matrix  that  maps  each  x{  to  a  new  d-dimensional  space. 

r  7  =  (FT,  FT) 


3.2.3  Cost  function 

The  cost  function  is  defined  as  the  following: 

c(F„Ft)  =  ii  e”-,  E£,  \fTA  -  J?4|f  w'.i 

+0-5  Efei  E?=1  ll^i  -  FT4f  wv+0.5  EC,  E”=i 

This  cost  function  represents  a  trade-off  between  preserving  local  geometry  of 
each  data  set  and  capturing  their  correspondence. 

3.2.4  Manifold  Alignment  Algorithm  [5] 

♦  Construct  matrices  Ws  and  Wt  to  capture  the  local  geometry  of  each 
data  set.  Construct  matrix  Wa  t  to  model  the  correspondence  between 
the  two  data  sets. 

♦  Compute  L,  Z  and  D  as  indicated  above  to  model  the  joint  structure  of 
the  two  data  sets. 

♦  Derive  the  optimal  mapping  function  using  d  minimum  eigenvectors, 
71,72,— ,74  the  generalized  eigenvalue  decomposition 

ZLZT  7  =  \ZDZTX 

♦  Let  Ft  be  part  of  [71,72,—  ,74]  from  row  1  to  row  pt.  Let  Ft  be  part  of 
[71,72,  —,74]  from  row  1  to  row  pt.  Then  Fj x\  and  Fjxrt  are  in  the 
same  space  and  can  be  aligned. 

3.2.5  Discussion  on  Correspondence 

Full  Correspondence:  If  the  correspondence  matrix  captures  full 
correspondence  information  of  the  two  data  sets,  then  the  problem  is  called 
supervised  manifold  alignment. 

Partial  Correspondence:  If  the  correspondence  matrix  W$it  captures 
partial  correspondence  information  of  the  two  data  sets,  then  the  problem  is 
called  semi-supervised  manifold  alignment. 

No  Correspondence:  If  there  is  no  correspondence  information  at  all,  then 
the  problem  is  called  unsupervised  manifold  alignment.  Since  x*  and  x\ 
cannot  be  directly  compared,  we  can  instead  use  x*  and  its  neighbors  to 
capture  the  local  geometry  of  x* .  We  can  characterize  the  local  geometry  of  x\ 
in  a  similar  way  [6].  Then  we  can  compare  the  local  relations  to  find 
correspondence  and  thus  generating  the  correspondence  matrix  W$yt. 
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4  Experiment 

4.1  Pendulum 

We  consider  the  following  pendulum  balance  problem.  Suppose  we  have  a 
pendulum  hanging  upwards.  We  want  to  apply  forces  to  its  attached  cart  so 
that  we  can  bring  it  as  close  as  the  balanced  position  as  possible.  The  state 
space  then  consists  of  the  vertical  angle  0  and  the  angular  velocity  0.  We 
assumed  that  the  angle  is  in  a  range  of  [-7r/2,  tt/2],  and  the  angular  velocity  is 
in  a  range  of  [—3,3].  The  action  space  is  {—50,0,50}.  Uniform  noise  in 
[—10, 10]  is  also  added  to  the  actions.  The  reward  is  set  to  be  1  if  |0|  <  tt/2;  it 
is  set  to  be  0  otherwise.  A  discount  factor  of  0.9  is  used. 

Since  we  have  a  2-dimensional  state  space,  (0,0),  we  want  to  sample  a 
discretization  set  of  size  m  x  n  (we  choose  m  30,  n  =  30).  We  first  evenly 
divide  the  state  space  into  a  m  x  n  grid,  so  that  cellij  represents  (0,0),  where 

0  6  [ — 7t/2  +  (i  —  1)  x  ir/m,  — tt/2  +  ix  tt/to] 

0  E  [—3+  ( j  —  1)  x  6/n,—  3  +  j  x  6/n] 

We  first  construct  the  sample  set  S  by  sampling  t  pairs  of  (0, 0)  from  each  cell 
we  just  defined.  We  then  construct  the  discretization  set  by  randomly  taking 
one  of  the  sampling  point  from  each  cell.  In  this  way,  for  each  of  the  three 
actions,  we  can  generate  a  sample  set  of  size  m  x  n  x  t,  and  a  corresponding 
discretization  set  of  size  m  x  n. 

We  set  the  distance  function  to  be  the  norm-2  distance  of  the  state  vector. 
Note  that  we  set  the  distance  to  infinity  for  those  with  different  actions.  We 
choose  the  horizon  H  —  20  and  use  k  —  10  nearest  neighbors  in  our  model. 

4.2  Pendulum  Swing  Up 

Similar  to  the  Pendulum  setup,  now  we  have  a  pendulum  hanging  downwards. 
We  want  to  apply  forces  to  its  attached  cart  to  bring  it  up  to  a  balanced 
position  as  quickly  as  possible.  We  assume  that  the  angle  is  in  a  range  of 
[—it, it],  and  the  angular  velocity  is  in  a  range  of  [-10, 10].  The  discretization 
set  is  of  size  20  x  60.  The  terminal  state  is  any  of  the  four  cells  in  the  center  of 
the  discretization  set  (both  0  and  0  are  close  to  zero).  The  reward  is  set  to  be 
1  if  the  pendulum  reaches  the  terminal  state  and  0  otherwise.  All  the  rest  of 
the  settings  are  set  to  be  the  same  as  the  pendulum  experiment. 

4.3  Transfer  Function  Specification 

4.3.1  Identity  Transfer 

We  first  conduct  a  sanity  check  using  the  identity  transfer  function.  That  is, 
ftr(x,  xi,x)  ->•  x'g.  This  is  just  to  demonstrate  the  baseline  performance  of  the 
PAC  RL  algorithm  when  a  transfer  function  is  not  involved. 
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4.3.2  Symmetry  Transfer 


We  then  applied  our  PAC  transfer  algorithm  on  a  slightly  more  complicated 
transfer  function,  the  symmetry  transfer  function.  Intuitively,  if  a  person  can 
perform  well  on  the  pendulum  balancing  task  at  a  state  (0o?^o)>  such  skill  should 
be  able  to  transfer  to  perform  well  at  a  new  state  (-#o»  -0o)-  The  symmetric 
transfer  function  is  formally  defined  as,  ftr(x,  x^,x)  — »■  —  x£,  if  x  and  x  are 
symmetric.  The  possible  existence  of  symmetry  can  be  checked  in  the  NN^ 
function  in  the  algorithm  by  changing  the  method  of  distance  calculation.  We 
can  then  simply  add  an  indicator  variable  specifying  whether  symmetry  exists. 
Such  indicator  variable  is  also  returned  in  the  NNk  function. 

4.3.3  Semi-supervised  Manifold  Alignment 

We  then  use  semi-supervised  manifold  alignment  technique  to  learn  such  a  sym¬ 
metry  transfer  function.  We  assume  that  we  have  perfect  knowledge  about  the 
action  transfer.  We  randomly  select  20%  of  the  source  domain  state  sample  as 
the  target  domain  state  sample.  We  then  label  the  correspondence  of  80%  of 
the  target  domain  state  sample.  The  sensitivity  to  the  amount  of  target  domain 
data  and  correspondence  pairs  is  discussed  in  the  result  section.  The  transfer 
function  is  formally  defined  as,  ftr(x,x^,x)  — >  x^.  We  choose  6  =  1  in  the  heat 
kernel  of  the  relationship  matrix  Wa  and  Wt.  We  use  fi  =  2  in  the  manifold 
alignment  cost  function.  We  also  set  the  dimension  of  the  new  space  to  be  2 
so  that  the  mapping  matrix  Fs  and  Ft  can  be  square  matrices  to  avoid  recon¬ 
struction  error.  Now  for  a  new  instance  in  source  domain  xty  we  can  compute 
its  corresponding  instance  xt  in  the  target  domain,  xt  =  Ft  1  Ftxt. 

4.3.4  Unsupervised  Manifold  Alignment 

For  unsupervisod  manifold  alignment,  again  we  assume  that  we  have  perfect 
knowledge  about  the  action  transfer.  We  select  70%  of  the  source  domain  state 
sample  as  the  target  domain  state  sample.  We  then  label  the  correspondence 
of  80%  of  the  target  domain  state  sample.  The  sensitivity  to  the  amount  of 
target  domain  data  and  correspondence  pairs  is  discussed  in  the  result  section. 
We  use  k  10  nearest  neighbour  to  generate  the  correspondence  matrix  Wa%t. 
All  the  rest  of  the  parameters  are  kept  the  same  as  semi-supervised  manifold 
alignment. 

5  Result  and  Discussion 

To  quantitatively  compare  the  difference  between  the  transfer  functions  men¬ 
tioned  above,  we  vary  the  number  of  samples  in  each  cell  of  the  sample  set  S  for 
both  experiments.  We  run  the  experiments  20  times  for  each  number  of  samples 
per  cell,  and  record  the  median  number  of  iterations  the  experiment  lasts  before 
it  terminates.  Note  that  for  both  experiments,  we  set  the  maximum  of  iteration 
to  be  5,000.  The  results  are  shown  in  Fig.  1. 
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(a)  Pendulum  Experiment 


(b)  Pendulum  Swing-up  Experiment 


Figure  1:  Comparison  of  Different  Transfer  Functions 


We  can  observe  that,  with  a  fixed  number  of  sample  per  cell  in  the  sample 
set,  PAC  Transfer  algorithm  with  a  perfect  symmetric  transfer  function  tends  to 
perform  better  than  with  a  simple  identity  transfer  function.  A  transfer  function 
learned  via  semi-supervised  manifold  alignment  generally  performs  slight  worse 
than  the  perfect  transfer  function  but  still  better  than  the  identity  transfer 
function.  For  unsupervised  manifold  alignment,  it  is  much  more  difficult  to 
learn  such  a  transfer  function,  and  its  learning  performance  has  quite  large 
variance.  It  generally  perform  similarly  to  the  identity  transfer  function  and 
sometimes  could  be  worse.  Wo  also  observe  a  lot  of  variations  between  runs 
with  a  fixed  sample  number  per  cell,  for  all  four  transfer  functions.  This  is 
mostly  caused  by  the  noise  that  we  add  in  our  simulation. 

We  also  test  the  influence  of  the  amount  of  target  domain  data  and  cor¬ 
respondence  information  to  the  performance  of  both  semi-supervised  and  un¬ 
supervised  manifold  alignment  (see  Fig.  2).  With  a  given  source  domain 
data  set  X9  (we  use  samples  from  the  pendulum  experiment),  we  use  —X9 
as  the  groundtruth  label  of  the  symmetry  transfer  function.  For  both  man¬ 
ifold  alignment  methods,  two  mappings,  FH  and  Ft  are  obtained.  We  apply 
these  two  mappings  to  all  source  domain  data  X„.  We  derive  the  testing  re¬ 
sult  Xteat  =  Ft~lFaXa.  The  loss  function  is  simply  defined  as  the  Euclidean 
distance  between  each  instance  of  Xteat  and  —X9.  We  use  the  average  loss 
to  evaluate  the  performance.  We  can  observe  that  for  both  manifold  align¬ 
ment  algorithms,  more  data  always  provide  better  performance.  We  notice  that 
semi-supervised  manifold  alignment  can  performance  decently  well  with  a  fairly 
small  amount  of  data,  but  on  the  other  hand,  unsupervised  manifold  alignment 
requires  much  more  data  to  reach  similar  performance.  We  also  observe  that 
more  correspondence  information  has  a  positive  impact  to  the  performance  of 
the  semi-supervised  manifold  alignment. 
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(a)  The  Influence  of  the  (b)  The  Influence  of  the  (c)  The  Influence  of  the 
Amount  of  Target  Domain  Amount  of  Correspondence  Amount  of  Target  Do- 
Data  on  Semi-supervised  on  Semi-supervised  Mani-  main  Data  on  Unsupervised 
Manifold  Alignment  fold  Alignment  Manifold  Alignment 


Figure  2:  Sensitivity  Tost,  on  Manifold  Alignment 


6  Future  Work 

We  can  conclude  that  the  introduction  of  a  transfer  function  to  PAC  RL  al¬ 
gorithm  can  indeed  reduce  sample  complexity  in  some  practical  applications. 
Manifold  alignment  shows  some  promises  in  learning  such  transfer  function. 

However,  the  experiments  we  conduct  are  still  fairly  simple.  To  show  that 
this  can  be  applied  to  more  real-world  problems,  we  need  to  design  some  more 
complicated  experiments.  It  would  also  be  interesting  to  explore  more  on  trans¬ 
fer  learning  without  correspondence  information  since  our  experiments  show 
that  the  unsupervised  manifold  alignment  method  is  somewhat  unreliable.  An¬ 
other  problem  is  that  the  PAC  algorithms  still  d<x*s  not  scale  well  even  with 
some  plausible  transfer  function.  Some  further  studies  on  mincing  sample*  com¬ 
plexity,  such  as  learning  through  demonstration,  would  be  helpful. 
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List  of  Acronyms 


MDP  -  Markov  Decision  Process 
PAC  -  Probably  Approximately  Correct 
GAN  -  Generative  Adversarial  Network 
UCB  -  Upper  Confidence  Bound 
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